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Preface 


Statistical learning refers to a set of tools for modeling and understanding 
complex datasets. It is a recently developed area in statistics and blends 
with parallel developments in computer science and, in particular, machine 
learning. The field encompasses many methods such as the lasso and sparse 
regression, classification and regression trees, and boosting and support 
vector machines. 

With the explosion of “Big Data” problems, statistical learning has be¬ 
come a very hot field in many scientific areas as well as marketing, finance, 
and other business disciplines. People with statistical learning skills are in 
high demand. 

One of the first books in this area— The Elements of Statistical Learning 
(ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a 
second edition in 2009. ESL has become a popular text not only in statis¬ 
tics but also in related fields. One of the reasons for ESL’s popularity is 
its relatively accessible style. But ESL is intended for individuals with ad¬ 
vanced training in the mathematical sciences. An Introduction to Statistical 
Learning (ISL) arose from the perceived need for a broader and less tech¬ 
nical treatment of these topics. In this new book, we cover many of the 
same topics as ESL, but we concentrate more on the applications of the 
methods and less on the mathematical details. We have created labs illus¬ 
trating how to implement each of the statistical learning methods using the 
popular statistical software package R. These labs provide the reader with 
valuable hands-on experience. 

This book is appropriate for advanced undergraduates or master’s stu¬ 
dents in statistics or related quantitative fields or for individuals in other 
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disciplines who wish to use statistical learning tools to analyze their data. 
It can be used as a textbook for a course spanning one or two semesters. 

We would like to thank several readers for valuable comments on prelim¬ 
inary drafts of this book: Pallavi Basu, Alexandra Chouldechova, Patrick 
Danaher, Will Fithian, Luella Fu, Sam Gross, Max Grazier G’Sell, Court¬ 
ney Paulson, Xinghao Qiao, Elisa Sheng, Noah Simon, Kean Ming Tan, 
and Xin Lu Tan. 


It’s tough to make predictions, especially about the future. 


Yogi Berra 


Los Angeles, USA 
Seattle, USA 
Palo Alto, USA 
Palo Alto, USA 


Robert Tibshirani 
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Introduction 


An Overview of Statistical Learning 

Statistical learning refers to a vast set of tools for understanding data. These 
tools can be classified as supervised or unsupervised. Broadly speaking, 
supervised statistical learning involves building a statistical model for pre¬ 
dicting, or estimating, an output based on one or more inputs. Problems of 
this nature occur in fields as diverse as business, medicine, astrophysics, and 
public policy. With unsupervised statistical learning, there are inputs but 
no supervising output; nevertheless we can learn relationships and struc¬ 
ture from such data. To provide an illustration of some applications of 
statistical learning, we briefly discuss three real-world data sets that are 
considered in this book. 


Wage Data 

In this application (which we refer to as the Wage data set throughout this 
book), we examine a number of factors that relate to wages for a group of 
males from the Atlantic region of the United States. In particular, we wish 
to understand the association between an employee’s age and education, as 
well as the calendar year, on his wage. Consider, for example, the left-hand 
panel of Figure 1.1, which displays wage versus age for each of the individu¬ 
als in the data set. There is evidence that wage increases with age but then 
decreases again after approximately age 60. The blue line, which provides 
an estimate of the average wage for a given age, makes this trend clearer. 

G. James et al., An Introduction to Statistical Learning: with Applications in R, 1 
Springer Texts in Statistics, DOI 10.1007/978-1-4614-7138-7—1, 
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FIGURE 1.1. Wage data, which contains income survey information for males 
from the central Atlantic region of the United States. Left: wage as a function of 
age. On average, wage increases with age until about 60 years of age, at which 
point it begins to decline. Center: wage as a function of year. There is a slow 
but steady increase of approximately $10,000 in the average wage between 2003 
and 2009. Right: Boxplots displaying wage as a function of education, with 1 
indicating the lowest level (no high school diploma) and 5 the highest level (an 
advanced graduate degree). On average, wage increases with the level of education. 


Given an employee’s age, we can use this curve to predict his wage. However, 
it is also clear from Figure 1.1 that there is a significant amount of vari¬ 
ability associated with this average value, and so age alone is unlikely to 
provide an accurate prediction of a particular man’s wage. 

We also have information regarding each employee’s education level and 
the year in which the wage was earned. The center and right-hand panels of 
Figure 1.1, which display wage as a function of both year and education, in¬ 
dicate that both of these factors are associated with wage. Wages increase 
by approximately $10,000, in a roughly linear (or straight-line) fashion, 
between 2003 and 2009, though this rise is very slight relative to the vari¬ 
ability in the data. Wages are also typically greater for individuals with 
higher education levels: men with the lowest education level (1) tend to 
have substantially lower wages than those with the highest education level 
(5). Clearly, the most accurate prediction of a given man’s wage will be 
obtained by combining his age, his education, and the year. In Chapter 3, 
we discuss linear regression, which can be used to predict wage from this 
data set. Ideally, we should predict wage in a way that accounts for the 
non-linear relationship between wage and age. In Chapter 7, we discuss a 
class of approaches for addressing this problem. 


Stock Market Data 

The Wage data involves predicting a continuous or quantitative output value. 
This is often referred to as a regression problem. However, in certain cases 
we may instead wish to predict a non-numerical value—that is, a categorical 















1. Introduction 


3 


Yesterday 


Two Days Previous 


Three Days Previous 





FIGURE 1.2. Left: Boxplots of the previous day’s percentage change in the S&P 
index for the days for which the market increased or decreased, obtained from the 
Smarket data. Center and Right: Same as left panel, but the percentage changes 
for 2 and 3 days previous are shown. 

or qualitative output. For example, in Chapter 4 we examine a stock mar¬ 
ket data set that contains the daily movements in the Standard & Poor’s 
500 (S&P) stock index over a 5-year period between 2001 and 2005. We 
refer to this as the Smarket data. The goal is to predict whether the index 
will increase or decrease on a given day using the past 5 days’ percentage 
changes in the index. Here the statistical learning problem does not in¬ 
volve predicting a numerical value. Instead it involves predicting whether 
a given day’s stock market performance will fall into the Up bucket or the 
Down bucket. This is known as a classification problem. A model that could 
accurately predict the direction in which the market will move would be 
very useful! 

The left-hand panel of Figure 1.2 displays two boxplots of the previous 
day’s percentage changes in the stock index: one for the 648 days for which 
the market increased on the subsequent day, and one for the 602 days for 
which the market decreased. The two plots look almost identical, suggest¬ 
ing that there is no simple strategy for using yesterday’s movement in the 
S&P to predict today’s returns. The remaining panels, which display box- 
plots for the percentage changes 2 and 3 days previous to today, similarly 
indicate little association between past and present returns. Of course, this 
lack of pattern is to be expected: in the presence of strong correlations be¬ 
tween successive days’ returns, one could adopt a simple trading strategy 
to generate profits from the market. Nevertheless, in Chapter 4, we explore 
these data using several different statistical learning methods. Interestingly, 
there are hints of some weak trends in the data that suggest that, at least 
for this 5-year period, it is possible to correctly predict the direction of 
movement in the market approximately 60% of the time (Figure 1.3). 
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FIGURE 1.3. We fit a quadratic discriminant analysis model to the subset 
of the Smarket data corresponding to the 2001-2004 time period, and predicted 
the probability of a stock market decrease using the 2005 data. On average, the 
predicted probability of decrease is higher for the days in which the market does 
decrease. Based on these results, we are able to correctly predict the direction of 
movement in the market 60% of the time. 


Gene Expression Data 

The previous two applications illustrate data sets with both input and 
output variables. However, another important class of problems involves 
situations in which we only observe input variables, with no corresponding 
output. For example, in a marketing setting, we might have demographic 
information for a number of current or potential customers. We may wish to 
understand which types of customers are similar to each other by grouping 
individuals according to their observed characteristics. This is known as a 
clustering problem. Unlike in the previous examples, here we are not trying 
to predict an output variable. 

We devote Chapter 10 to a discussion of statistical learning methods 
for problems in which no natural output variable is available. We consider 
the NCI60 data set, which consists of 6,830 gene expression measurements 
for each of 64 cancer cell lines. Instead of predicting a particular output 
variable, we are interested in determining whether there are groups, or 
clusters, among the cell lines based on their gene expression measurements. 
This is a difficult question to address, in part because there are thousands 
of gene expression measurements per cell line, making it hard to visualize 
the data. 

The left-hand panel of Figure 1.4 addresses this problem by represent¬ 
ing each of the 64 cell lines using just two numbers, Z\ and Z 2 . These 
are the first two principal components of the data, which summarize the 
6,830 expression measurements for each cell line down to two numbers or 
dimensions. While it is likely that this dimension reduction has resulted in 
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FIGURE 1.4. Left: Representation of the NCI60 gene expression data set in 
a two-dimensional space, Z\ and Z 2 ■ Each point corresponds to one of the 64 
cell lines. There appear to be four groups of cell lines, which we have represented 
using different colors. Right: Same as left panel except that we have represented 
each of the 14 different types of cancer using a different colored symbol. Cell lines 
corresponding to the same cancer type tend to be nearby in the two-dimensional 
space. 

some loss of information, it is now possible to visually examine the data for 
evidence of clustering. Deciding on the number of clusters is often a diffi¬ 
cult problem. But the left-hand panel of Figure 1.4 suggests at least four 
groups of cell lines, which we have represented using separate colors. We 
can now examine the cell lines within each cluster for similarities in their 
types of cancer, in order to better understand the relationship between 
gene expression levels and cancer. 

In this particular data set, it turns out that the cell lines correspond 
to 14 different types of cancer. (However, this information was not used 
to create the left-hand panel of Figure 1.4.) The right-hand panel of Fig¬ 
ure 1.4 is identical to the left-hand panel, except that the 14 cancer types 
are shown using distinct colored symbols. There is clear evidence that cell 
lines with the same cancer type tend to be located near each other in this 
two-dimensional representation. In addition, even though the cancer infor¬ 
mation was not used to produce the left-hand panel, the clustering obtained 
does bear some resemblance to some of the actual cancer types observed 
in the right-hand panel. This provides some independent verification of the 
accuracy of our clustering analysis. 


A Brief History of Statistical Learning 

Though the term statistical learning is fairly new, many of the concepts 
that underlie the field were developed long ago. At the beginning of the 
nineteenth century, Legendre and Gauss published papers on the method 
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of least squares , which implemented the earliest form of what is now known 
as linear regression. The approach was first successfully applied to problems 
in astronomy. Linear regression is used for predicting quantitative values, 
such as an individual’s salary. In order to predict qualitative values, such as 
whether a patient survives or dies, or whether the stock market increases 
or decreases, Fisher proposed linear discriminant analysis in 1936. In the 
1940s, various authors put forth an alternative approach, logistic regression. 
In the early 1970s, Nelder and Wedderburn coined the term generalized 
linear models for an entire class of statistical learning methods that include 
both linear and logistic regression as special cases. 

By the end of the 1970s, many more techniques for learning from data 
were available. However, they were almost exclusively linear methods, be¬ 
cause fitting non-linear relationships was computationally infeasible at the 
time. By the 1980s, computing technology had finally improved sufficiently 
that non-linear methods were no longer computationally prohibitive. In mid 
1980s Breiman, Friedman, Olshen and Stone introduced classification and 
regression trees , and were among the first to demonstrate the power of a 
detailed practical implementation of a method, including cross-validation 
for model selection. Hastie and Tibshirani coined the term generalized addi¬ 
tive models in 1986 for a class of non-linear extensions to generalized linear 
models, and also provided a practical software implementation. 

Since that time, inspired by the advent of machine learning and other 
disciplines, statistical learning has emerged as a new subfield in statistics, 
focused on supervised and unsupervised modeling and prediction. In recent 
years, progress in statistical learning has been marked by the increasing 
availability of powerful and relatively user-friendly software, such as the 
popular and freely available R system. This has the potential to continue 
the transformation of the field from a set of techniques used and developed 
by statisticians and computer scientists to an essential toolkit for a much 
broader community. 


This Book 

The Elements of Statistical Learning (ESL) by Hastie, Tibshirani, and 
Friedman was first published in 2001. Since that time, it has become an 
important reference on the fundamentals of statistical machine learning. 
Its success derives from its comprehensive and detailed treatment of many 
important topics in statistical learning, as well as the fact that (relative to 
many upper-level statistics textbooks) it is accessible to a wide audience. 
However, the greatest factor behind the success of ESL has been its topical 
nature. At the time of its publication, interest in the field of statistical 
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learning was starting to explode. ESL provided one of the first accessible 
and comprehensive introductions to the topic. 

Since ESL was first published, the field of statistical learning has con¬ 
tinued to flourish. The field’s expansion has taken two forms. The most 
obvious growth has involved the development of new and improved statis¬ 
tical learning approaches aimed at answering a range of scientific questions 
across a number of fields. However, the field of statistical learning has 
also expanded its audience. In the 1990s, increases in computational power 
generated a surge of interest in the field from non-statisticians who were 
eager to use cutting-edge statistical tools to analyze their data. Unfortu¬ 
nately, the highly technical nature of these approaches meant that the user 
community remained primarily restricted to experts in statistics, computer 
science, and related fields with the training (and time) to understand and 
implement them. 

In recent years, new and improved software packages have significantly 
eased the implementation burden for many statistical learning methods. 
At the same time, there has been growing recognition across a number of 
fields, from business to health care to genetics to the social sciences and 
beyond, that statistical learning is a powerful tool with important practical 
applications. As a result, the field has moved from one of primarily academic 
interest to a mainstream discipline, with an enormous potential audience. 
This trend will surely continue with the increasing availability of enormous 
quantities of data and the software to analyze it. 

The purpose of An Introduction to Statistical Learning (ISL) is to facili¬ 
tate the transition of statistical learning from an academic to a mainstream 
field. ISL is not intended to replace ESL, which is a far more comprehen¬ 
sive text both in terms of the number of approaches considered and the 
depth to which they are explored. We consider ESL to be an important 
companion for professionals (with graduate degrees in statistics, machine 
learning, or related fields) who need to understand the technical details 
behind statistical learning approaches. However, the community of users of 
statistical learning techniques has expanded to include individuals with a 
wider range of interests and backgrounds. Therefore, we believe that there 
is now a place for a less technical and more accessible version of ESL. 

In teaching these topics over the years, we have discovered that they are 
of interest to master’s and PhD students in fields as disparate as business 
administration, biology, and computer science, as well as to quantitatively- 
oriented upper-division undergraduates. It is important for this diverse 
group to be able to understand the models, intuitions, and strengths and 
weaknesses of the various approaches. But for this audience, many of the 
technical details behind statistical learning methods, such as optimiza¬ 
tion algorithms and theoretical properties, are not of primary interest. 
We believe that these students do not need a deep understanding of these 
aspects in order to become informed users of the various methodologies, and 
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in order to contribute to their chosen fields through the use of statistical 
learning tools. 

ISLR is based on the following four premises. 

1. Many statistical learning methods are relevant and useful in a wide 
range of academic and non-academic disciplines, beyond just the sta¬ 
tistical sciences. We believe that many contemporary statistical learn¬ 
ing procedures should, and will, become as widely available and used 
as is currently the case for classical methods such as linear regres¬ 
sion. As a result, rather than attempting to consider every possible 
approach (an impossible task), we have concentrated on presenting 
the methods that we believe are most widely applicable. 

2. Statistical learning should not be viewed as a series of black boxes. No 
single approach will perform well in all possible applications. With¬ 
out understanding all of the cogs inside the box, or the interaction 
between those cogs, it is impossible to select the best box. Hence, we 
have attempted to carefully describe the model, intuition, assump¬ 
tions, and trade-offs behind each of the methods that we consider. 

3. While it is important to know what job is performed by each cog, it 
is not necessary to have the skills to construct the machine inside the 
box! Thus, we have minimized discussion of technical details related 
to fitting procedures and theoretical properties. We assume that the 
reader is comfortable with basic mathematical concepts, but we do 
not assume a graduate degree in the mathematical sciences. For in¬ 
stance, we have almost completely avoided the use of matrix algebra, 
and it is possible to understand the entire book without a detailed 
knowledge of matrices and vectors. 

4. We presume that the reader is interested in applying statistical learn¬ 
ing methods to real-world problems. In order to facilitate this, as well 
as to motivate the techniques discussed, we have devoted a section 
within each chapter to R computer labs. In each lab, we walk the 
reader through a realistic application of the methods considered in 
that chapter. When we have taught this material in our courses, 
we have allocated roughly one-third of classroom time to working 
through the labs, and we have found them to be extremely useful. 
Many of the less computationally-oriented students who were ini¬ 
tially intimidated by R’s command level interface got the hang of 
things over the course of the quarter or semester. We have used R 
because it is freely available and is powerful enough to implement all 
of the methods discussed in the book. It also has optional packages 
that can be downloaded to implement literally thousands of addi¬ 
tional methods. Most importantly, R is the language of choice for 
academic statisticians, and new approaches often become available in 
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R years before they are implemented in commercial packages. How¬ 
ever, the labs in ISL are self-contained, and can be skipped if the 
reader wishes to use a different software package or does not wish to 
apply the methods discussed to real-world problems. 


Who Should Read This Book? 

This book is intended for anyone who is interested in using modern statis¬ 
tical methods for modeling and prediction from data. This group includes 
scientists, engineers, data analysts, or quants , but also less technical indi¬ 
viduals with degrees in non-quantitative fields such as the social sciences or 
business. We expect that the reader will have had at least one elementary 
course in statistics. Background in linear regression is also useful, though 
not required, since we review the key concepts behind linear regression in 
Chapter 3. The mathematical level of this book is modest, and a detailed 
knowledge of matrix operations is not required. This book provides an in¬ 
troduction to the statistical programming language R. Previous exposure 
to a programming language, such as MATLAB or Python, is useful but not 
required. 

We have successfully taught material at this level to master’s and PhD 
students in business, computer science, biology, earth sciences, psychology, 
and many other areas of the physical and social sciences. This book could 
also be appropriate for advanced undergraduates who have already taken 
a course on linear regression. In the context of a more mathematically 
rigorous course in which ESL serves as the primary textbook, ISL could 
be used as a supplementary text for teaching computational aspects of the 
various approaches. 


Notation and Simple Matrix Algebra 

Choosing notation for a textbook is always a difficult task. For the most 
part we adopt the same notational conventions as ESL. 

We will use n to represent the number of distinct data points, or observa¬ 
tions, in our sample. We will let p denote the number of variables that are 
available for use in making predictions. For example, the Wage data set con¬ 
sists of 12 variables for 3,000 people, so we have n = 3,000 observations and 
p = 12 variables (such as year, age, wage, and more). Note that throughout 
this book, we indicate variable names using colored font: Variable Name. 

In some examples, p might be quite large, such as on the order of thou¬ 
sands or even millions; this situation arises quite often, for example, in the 
analysis of modern biological data or web-based advertising data. 
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In general, we will let Xij represent the value of the jth variable for the 
*th observation, where i = 1, 2,... ,n and j = 1,2,... ,p. Throughout this 
book, i will be used to index the samples or observations (from 1 to n) and 
j will be used to index the variables (from 1 to p). We let X denote a n x p 
matrix whose (i,j)th element is x,j. That is, 


/ xn 

Xl2 ■ 

■ x lp \ 

X 2 1 

X22 ■ 

X2p 

\Xnl 

Xn2 

• %np J 


For readers who are unfamiliar with matrices, it is useful to visualize X as 
a spreadsheet of numbers with n rows and p columns. 

At times we will be interested in the rows of X, which we write as 
X\, X 2 , ■ ■ ■, x n . Here x, is a vector of length p , containing the p variable 
measurements for the ith observation. That is, 


fxn\ 

Xi2 

\Xip J 


( 1 . 1 ) 


(Vectors are by default represented as columns.) For example, for the Wage 
data, Xi is a vector of length 12, consisting of year, age, wage, and other 
values for the *th individual. At other times we will instead be interested 
in the columns of X, which we write as xi, X 2 ,...,x p . Each is a vector of 
length n. That is, 


( x i i\ 

X 2j 


\Xnj) 


For example, for the Wage data, xi contains the n = 3,000 values for year. 
Using this notation, the matrix X can be written as 


or 


X = (xi x 2 


x p) i 
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The T notation denotes the transpose of a matrix or vector. So, for example, 


while 



( Xu 

X21 ■ 

• X n l\ 

X T = 

X12 

X22 ■ 

%n 2 


\%lp 

% 2 p 

• %np ) 

T 

= 

(■ Xil 

Xi 2 * • 

’ %ip) • 


We use yi to denote the ith observation of the variable on which we 
wish to make predictions, such as wage. Hence, we write the set of all n 
observations in vector form as 


y = 


[ Vl \ 

2/2 

\yj 


Then our observed data consists of {(xi, yi ), (a? 2 , S/ 2 ), ■ ■ •, (x n ,yn)}, where 
each Xj is a vector of length p. (If p = 1, then Xi is simply a scalar.) 

In this text, a vector of length n will always be denoted in lower case 
bold ; e.g. 


a = 


( aA 

a 2 


\a n J 


However, vectors that are not of length n (such as feature vectors of length 
p , as in (1.1)) will be denoted in lower case normal font , e.g. a. Scalars will 
also be denoted in lower case normal font, e.g. a. In the rare cases in which 
these two uses for lower case normal font lead to ambiguity, we will clarify 
which use is intended. Matrices will be denoted using bold capitals, such 
as A. Random variables will be denoted using capital normal font, e.g. A, 
regardless of their dimensions. 

Occasionally we will want to indicate the dimension of a particular ob¬ 
ject. To indicate that an object is a scalar, we will use the notation a £ R. 
To indicate that it is a vector of length k, we will use a £ (or a £ R 11 
if it is of length n). We will indicate that an object is a r x s matrix using 
A £ K rxs . 

We have avoided using matrix algebra whenever possible. However, in 
a few instances it becomes too cumbersome to avoid it entirely. In these 
rare instances it is important to understand the concept of multiplying 
two matrices. Suppose that A £ R rxd and B £ ]R dxs . Then the product 
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of A and B is denoted AB. The (i, j)th element of AB is computed by 
multiplying each element of the ith row of A by the corresponding element 
of the jth column of B. That is, (AB),j = a ikbkj ■ As an example, 

consider 

A= (3 4) “ d B =0 s)- 

Then 

/l 2\ /5 6\ /lx 5 + 2x7 1 x 6 + 2 x 8\ _ /19 22\ 

At5 ~ \3 A) V7 87^3x5 + 4x7 3x6 + 4x87 ^43 507' 

Note that this operation produces an r x s matrix. It is only possible to 
compute AB if the number of columns of A is the same as the number of 
rows of B. 


Organization of This Book 

Chapter 2 introduces the basic terminology and concepts behind statisti¬ 
cal learning. This chapter also presents the K-nearest neighbor classifier, a 
very simple method that works surprisingly well on many problems. Chap¬ 
ters 3 and 4 cover classical linear methods for regression and classification. 
In particular, Chapter 3 reviews linear regression , the fundamental start¬ 
ing point for all regression methods. In Chapter 4 we discuss two of the 
most important classical classification methods, logistic regression and lin¬ 
ear discriminant analysis. 

A central problem in all statistical learning situations involves choosing 
the best method for a given application. Hence, in Chapter 5 we intro¬ 
duce cross-validation and the bootstrap, which can be used to estimate the 
accuracy of a number of different methods in order to choose the best one. 

Much of the recent research in statistical learning has concentrated on 
non-linear methods. However, linear methods often have advantages over 
their non-linear competitors in terms of interpretability and sometimes also 
accuracy. Hence, in Chapter 6 we consider a host of linear methods, both 
classical and more modern, which offer potential improvements over stan¬ 
dard linear regression. These include stepwise selection, ridge regression, 
principal components regression, partial least squares, and the lasso. 

The remaining chapters move into the world of non-linear statistical 
learning. We first introduce in Chapter 7 a number of non-linear methods 
that work well for problems with a single input variable. We then show how 
these methods can be used to fit non-linear additive models for which there 
is more than one input. In Chapter 8, we investigate tree-based methods, 
including bagging, boosting, and random forests. Support vector machines, 
a set of approaches for performing both linear and non-linear classification, 
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are discussed in Chapter 9. Finally, in Chapter 10, we consider a setting 
in which we have input variables but no output variable. In particular, we 
present principal components analysis, K-means clustering, and hierarchi¬ 
cal clustering. 

At the end of each chapter, we present one or more R lab sections in 
which we systematically work through applications of the various meth¬ 
ods discussed in that chapter. These labs demonstrate the strengths and 
weaknesses of the various approaches, and also provide a useful reference 
for the syntax required to implement the various methods. The reader may 
choose to work through the labs at his or her own pace, or the labs may 
be the focus of group sessions as part of a classroom environment. Within 
each R lab, we present the results that we obtained when we performed 
the lab at the time of writing this book. However, new versions of R are 
continuously released, and over time, the packages called in the labs will be 
updated. Therefore, in the future, it is possible that the results shown in 
the lab sections may no longer correspond precisely to the results obtained 
by the reader who performs the labs. As necessary, we will post updates to 
the labs on the book website. 

We use the & symbol to denote sections or exercises that contain more 
challenging concepts. These can be easily skipped by readers who do not 
wish to delve as deeply into the material, or who lack the mathematical 
background. 


Data Sets Used in Labs and Exercises 

In this textbook, we illustrate statistical learning methods using applica¬ 
tions from marketing, finance, biology, and other areas. The ISLR package 
available on the book website contains a number of data sets that are 
required in order to perform the labs and exercises associated with this 
book. One other data set is contained in the MASS library, and yet another 
is part of the base R distribution. Table 1.1 contains a summary of the data 
sets required to perform the labs and exercises. A couple of these data sets 
are also available as text files on the book website, for use in Chapter 2. 


Book Website 

The website for this book is located at 


www.StatLearning.com 
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Name 

Auto 

Boston 

Caravan 

Carseats 

College 

Default 

Hitters 

Khan 

NCI60 

OJ 

Portfolio 

Smarket 

USArrests 

Wage 

Weekly 


Description 

Gas mileage, horsepower, and other information for cars. 

Housing values and other information about Boston suburbs. 
Information about individuals offered caravan insurance. 
Information about car seat sales in 400 stores. 

Demographic characteristics, tuition, and more for USA colleges. 
Customer default records for a credit card company. 

Records and salaries for baseball players. 

Gene expression measurements for four cancer types. 

Gene expression measurements for 64 cancer cell lines. 

Sales information for Citrus Hill and Minute Maid orange juice. 
Past values of financial assets, for use in portfolio allocation. 
Daily percentage returns for S&P 500 over a 5-year period. 
Crime statistics per 100,000 residents in 50 states of USA. 
Income survey data for males in central Atlantic region of USA. 
1,089 weekly stock market returns for 21 years. 


TABLE 1.1. A list of data sets needed to perform the labs and exercises in this 
textbook. All data sets are available in the ISLR library, with the exception of 
Boston (part of MASS) and USArrests (part of the base R distribution). 


It contains a number of resources, including the R package associated with 
this book, and some additional data sets. 
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Statistical Learning 


2.1 What Is Statistical Learning? 


In order to motivate our study of statistical learning, we begin with a 
simple example. Suppose that we are statistical consultants hired by a 
client to provide advice on how to improve sales of a particular product. The 
Advertising data set consists of the sales of that product in 200 different 
markets, along with advertising budgets for the product in each of those 
markets for three different media: TV, radio, and newspaper. The data are 
displayed in Figure 2.1. It is not possible for our client to directly increase 
sales of the product. On the other hand, they can control the advertising 
expenditure in each of the three media. Therefore, if we determine that 
there is an association between advertising and sales, then we can instruct 
our client to adjust advertising budgets, thereby indirectly increasing sales. 
In other words, our goal is to develop an accurate model that can be used 
to predict sales on the basis of the three media budgets. 

In this setting, the advertising budgets are input variables while sales 
is an output variable. The input variables are typically denoted using the 
symbol X , with a subscript to distinguish them. So X\ might be the TV 
budget, X -2 the radio budget, and X$ the newspaper budget. The inputs 
go by different names, such as predictors , independent variables , features , 
or sometimes just variables. The output variable—in this case, sales— is 
often called the response or dependent variable, and is typically denoted 
using the symbol Y. Throughout this book, we will use all of these terms 
interchangeably. 


input 

variable 

output 

variable 


predictor 

independent 

variable 

feature 

variable 

response 


dependent 

variable 
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TV 


Radio 



FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands 
of units, as a function of TV, radio, and newspaper budgets, in thousands of 
dollars, for 200 different markets. In each plot we show the simple least squares 
fit of sales to that variable, as described in Chapter 3. In other words, each blue 
line represents a simple model that can be used to predict sales using TV, radio, 
and newspaper, respectively. 


More generally, suppose that we observe a quantitative response Y and p 
different predictors, Xi, X 2 , . ■., X p . We assume that there is some 
relationship between Y and X = (Xi, X 2 , ..., X p ), which can be written 
in the very general form 


Y = f(X) + e. 


( 2 . 1 ) 


Here / is some fixed but unknown function of X \,..., X p . and e is a random 
error term , which is independent of X and has mean zero. In this formula¬ 
tion, / represents the systematic information that X provides about Y. 

As another example, consider the left-hand panel of Figure 2.2, a plot of 
income versus years of education for 30 individuals in the Income data set. 
The plot suggests that one might be able to predict income using years of 
education. However, the function / that connects the input variable to the 
output variable is in general unknown. In this situation one must estimate 
/ based on the observed points. Since Income is a simulated data set, / is 
known and is shown by the blue curve in the right-hand panel of Figure 2.2. 
The vertical lines represent the error terms e. We note that some of the 
30 observations lie above the blue curve and some lie below it; overall, the 
errors have approximately mean zero. 

In general, the function / may involve more than one input variable. 
In Figure 2.3 we plot income as a function of years of education and 
seniority. Here / is a two-dimensional surface that must be estimated 
based on the observed data. 


error term 
systematic 
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FIGURE 2.2. The Income data set. Left: The red dots are the observed values 
of income (in tens of thousands of dollars) and years of education for 30 indi¬ 
viduals. Right: The blue curve represents the true underlying relationship between 
income and years of education, which is generally unknown (but is known in 
this case because the data were simulated). The black lines represent the error 
associated with each observation. Note that some errors are positive (if an ob¬ 
servation lies above the blue curve) and some are negative (if an observation lies 
below the curve). Overall, these errors have approximately mean zero. 


In essence, statistical learning refers to a set of approaches for estimating 
/. In this chapter we outline some of the key theoretical concepts that arise 
in estimating /, as well as tools for evaluating the estimates obtained. 


2.1.1 Why Estimate f ? 

There are two main reasons that we may wish to estimate /: prediction 
and inference. We discuss each in turn. 


Prediction 

In many situations, a set of inputs X are readily available, but the output 
Y cannot be easily obtained. In this setting, since the error term averages 
to zero, we can predict Y using 


Y = f(X), 


( 2 . 2 ) 


where / represents our estimate for /, and Y represents the resulting pre¬ 
diction for Y. In this setting, / is often treated as a black box, in the sense 
that one is not typically concerned with the exact form of /, provided that 
it yields accurate predictions for Y. 
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FIGURE 2.3. The plot displays income as a function of years of education 
and seniority in the Income data set. The blue surface represents the true un¬ 
derlying relationship between income and years of education and seniority, 
which is known since the data are simulated. The red dots indicate the observed 
values of these quantities for 30 individuals. 


As an example, suppose that Xi,..., X p are characteristics of a patient’s 
blood sample that can be easily measured in a lab, and Y is a variable 
encoding the patient’s risk for a severe adverse reaction to a particular 
drug. It is natural to seek to predict Y using X , since we can then avoid 
giving the drug in question to patients who are at high risk of an adverse 
reaction—that is, patients for whom the estimate of Y is high. 

The accuracy of Y as a prediction for Y depends on two quantities, 
which we will call the reducible error and the irreducible error. In general, 
/ will not be a perfect estimate for /, and this inaccuracy will introduce 
some error. This error is reducible because we can potentially improve the 
accuracy of / by using the most appropriate statistical learning technique to 
estimate /. However, even if it were possible to form a perfect estimate for 
/, so that our estimated response took the form Y = /(X), our prediction 
would still have some error in it! This is because Y is also a function of 
e, which, by definition, cannot be predicted using X. Therefore, variability 
associated with e also affects the accuracy of our predictions. This is known 
as the irreducible error, because no matter how well we estimate /, we 
cannot reduce the error introduced by e. 

Why is the irreducible error larger than zero? The quantity e may con¬ 
tain unmeasured variables that are useful in predicting Y : since we don’t 
measure them, / cannot use them for its prediction. The quantity e may 
also contain unmeasurable variation. For example, the risk of an adverse 
reaction might vary for a given patient on a given day, depending on 


reducible 

error 

irreducible 

error 
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manufacturing variation in the drug itself or the patient’s general feeling 
of well-being on that day. 

Consider a given estimate / and a set of predictors X, which yields the 
prediction Y = f(X). Assume for a moment that both / and X are fixed. 
Then, it is easy to show that 

E(Y — Y) 2 = E[f(X) + e-f(X)] 2 

= [f(X)-f(X)) 2 + Var(e) , (2.3) 

Reducible Irreducible 

where E(Y — Y) 2 represents the average, or expected value, of the squared 
difference between the predicted and actual value of Y , and Var(e) repre¬ 
sents the variance associated with the error term e. 

The focus of this book is on techniques for estimating / with the aim of 
minimizing the reducible error. It is important to keep in mind that the 
irreducible error will always provide an upper bound on the accuracy of 
our prediction for Y. This bound is almost always unknown in practice. 

Inference 

We are often interested in understanding the way that Y is affected as 
Xi ,..., X p change. In this situation we wish to estimate /, but our goal is 
not necessarily to make predictions for Y. We instead want to understand 
the relationship between A' and Y, or more specifically, to understand how 
Y changes as a function of X \,..., X p . Now / cannot be treated as a black 
box, because we need to know its exact form. In this setting, one may be 
interested in answering the following questions: 

• Which predictors are associated with the response? It is often the case 
that only a small fraction of the available predictors are substantially 
associated with Y. Identifying the few important predictors among a 
large set of possible variables can be extremely useful, depending on 
the application. 

• What is the relationship between the response and each predictor? 
Some predictors may have a positive relationship with Y, in the sense 
that increasing the predictor is associated with increasing values of 
Y. Other predictors may have the opposite relationship. Depending 
on the complexity of /, the relationship between the response and a 
given predictor may also depend on the values of the other predictors. 

• Can the relationship between Y and each predictor be adequately sum¬ 
marized using a linear equation, or is the relationship more compli¬ 
cated? Historically, most methods for estimating / have taken a linear 
form. In some situations, such an assumption is reasonable or even de¬ 
sirable. But often the true relationship is more complicated, in which 
case a linear model may not provide an accurate representation of 
the relationship between the input and output variables. 


expected 

value 

variance 
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In this book, we will see a number of examples that fall into the prediction 
setting, the inference setting, or a combination of the two. 

For instance, consider a company that is interested in conducting a 
direct-marketing campaign. The goal is to identify individuals who will 
respond positively to a mailing, based on observations of demographic vari¬ 
ables measured on each individual. In this case, the demographic variables 
serve as predictors, and response to the marketing campaign (either pos¬ 
itive or negative) serves as the outcome. The company is not interested 
in obtaining a deep understanding of the relationships between each in¬ 
dividual predictor and the response; instead, the company simply wants 
an accurate model to predict the response using the predictors. This is an 
example of modeling for prediction. 

In contrast, consider the Advertising data illustrated in Figure 2.1. One 
may be interested in answering questions such as: 

- Which media contribute to sales? 

Which media generate the biggest boost in sales? or 

- How much increase in sales is associated with a given increase in TV 
advertising? 

This situation falls into the inference paradigm. Another example involves 
modeling the brand of a product that a customer might purchase based on 
variables such as price, store location, discount levels, competition price, 
and so forth. In this situation one might really be most interested in how 
each of the individual variables affects the probability of purchase. For 
instance, what effect will changing the price of a product have on sales? 
This is an example of modeling for inference. 

Finally, some modeling could be conducted both for prediction and infer¬ 
ence. For example, in a real estate setting, one may seek to relate values of 
homes to inputs such as crime rate, zoning, distance from a river, air qual¬ 
ity, schools, income level of community, size of houses, and so forth. In this 
case one might be interested in how the individual input variables affect 
the prices—that is, how much extra will a house be worth if it has a view 
of the river? This is an inference problem. Alternatively, one may simply 
be interested in predicting the value of a home given its characteristics: is 
this house under- or over-valued? This is a prediction problem. 

Depending on whether our ultimate goal is prediction, inference, or a 
combination of the two, different methods for estimating / may be appro¬ 
priate. For example, linear models allow for relatively simple and inter¬ 
pretable inference, but may not yield as accurate predictions as some other 
approaches. In contrast, some of the highly non-linear approaches that we 
discuss in the later chapters of this book can potentially provide quite accu¬ 
rate predictions for Y, but this comes at the expense of a less interpretable 
model for which inference is more challenging. 
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2.1.2 How Do We Estimate f? 

Throughout this book, we explore many linear and non-linear approaches 
for estimating /. However, these methods generally share certain charac¬ 
teristics. We provide an overview of these shared characteristics in this 
section. We will always assume that we have observed a set of n different 
data points. For example in Figure 2.2 we observed n = 30 data points. 
These observations are called the training data because we will use these 
observations to train, or teach, our method how to estimate /. Let Xtj 
represent the value of the jth predictor, or input, for observation i, where 
i = 1,2,..., n and j = 1,2 ,p. Correspondingly, let yi represent the 
response variable for the ith observation. Then our training data consist of 
{(xi,yi),(x 2 ,y 2 ),...,(x n ,y n )} where = (xa, x i2 , ..., x ip ) T . 

Our goal is to apply a statistical learning method to the training data 
in order to estimate the unknown function /. In other words, we want to 
find a function / such that Y « f(X) for any observation (X,Y). Broadly 
speaking, most statistical learning methods for this task can be character¬ 
ized as either parametric or non-parametric. We now briefly discuss these 
two types of approaches. 

Parametric Methods 

Parametric methods involve a two-step model-based approach. 

1. First, we make an assumption about the functional form, or shape, 
of /. For example, one very simple assumption is that / is linear in 
X: 

f(X) = Po + PiXi + P 2 X 2 + ... + PpX p . (2-4) 

This is a linear model , which will be discussed extensively in Chap¬ 
ter 3. Once we have assumed that / is linear, the problem of estimat¬ 
ing / is greatly simplified. Instead of having to estimate an entirely 
arbitrary p-dimensional function f(X), one only needs to estimate 
the p + 1 coefficients /?o, /?i, • ■ •, P p - 

2. After a model has been selected, we need a procedure that uses the 
training data to fit or train the model. In the case of the linear model 
(2.4), we need to estimate the parameters /?o, /3i, • • ■, P p - That is, we 
want to find values of these parameters such that 

Y ~ /3o + fd\X\ + P 2 X 2 + ■ ■. + fipXp. 

The most common approach to fitting the model (2.4) is referred 
to as (ordinary) least squares , which we discuss in Chapter 3. How¬ 
ever, least squares is one of many possible ways way to fit the linear 
model. In Chapter 6, we discuss other approaches for estimating the 
parameters in (2.4). 

The model-based approach just described is referred to as parametric, 
it reduces the problem of estimating / down to one of estimating a set of 
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FIGURE 2.4. A linear model fit by least squares to the Income data from Fig¬ 
ure 2.3. The observations are shown in red , and the yellow plane indicates the 
least squares fit to the data. 


parameters. Assuming a parametric form for / simplifies the problem of 
estimating / because it is generally much easier to estimate a set of pa¬ 
rameters, such as /?o,/3i, ■ ■ ■ ,Pp in the linear model (2.4), than it is to fit 
an entirely arbitrary function /. The potential disadvantage of a paramet¬ 
ric approach is that the model we choose will usually not match the true 
unknown form of /. If the chosen model is too far from the true /, then 
our estimate will be poor. We can try to address this problem by choos¬ 
ing flexible models that can fit many different possible functional forms 
for /. But in general, fitting a more flexible model requires estimating a 
greater number of parameters. These more complex models can lead to a 
phenomenon known as overfitting the data, which essentially means they 
follow the errors, or noise , too closely. These issues are discussed through¬ 
out this book. 

Figure 2.4 shows an example of the parametric approach applied to the 
Income data from Figure 2.3. We have fit a linear model of the form 

income ss /?o + /?i x education + @2 x seniority. 

Since we have assumed a linear relationship between the response and the 
two predictors, the entire fitting problem reduces to estimating /3o, /3i, and 
/3 2 , which we do using least squares linear regression. Comparing Figure 2.3 
to Figure 2.4, we can see that the linear fit given in Figure 2.4 is not quite 
right: the true / has some curvature that is not captured in the linear fit. 
However, the linear fit still appears to do a reasonable job of capturing the 
positive relationship between years of education and income, as well as the 


flexible 

overfitting 

noise 



2.1 What Is Statistical Learning? 


23 



FIGURE 2.5. A smooth thin-plate spline fit to the Income data from Figure 2.3 
is shown in yellow; the observations are displayed in red. Splines are discussed in 
Chapter 7. 


slightly less positive relationship between seniority and income. It may be 
that with such a small number of observations, this is the best we can do. 

Non-parametric Methods 

Non-parametric methods do not make explicit assumptions about the func¬ 
tional form of /. Instead they seek an estimate of / that gets as close to the 
data points as possible without being too rough or wiggly. Such approaches 
can have a major advantage over parametric approaches: by avoiding the 
assumption of a particular functional form for /, they have the potential 
to accurately fit a wider range of possible shapes for /. Any parametric 
approach brings with it the possibility that the functional form used to 
estimate / is very different from the true /, in which case the resulting 
model will not fit the data well. In contrast, non-parametric approaches 
completely avoid this danger, since essentially no assumption about the 
form of / is made. But non-parametric approaches do suffer from a major 
disadvantage: since they do not reduce the problem of estimating / to a 
small number of parameters, a very large number of observations (far more 
than is typically needed for a parametric approach) is required in order to 
obtain an accurate estimate for /. 

An example of a non-parametric approach to fitting the Income data is 
shown in Figure 2.5. A thin-plate spline is used to estimate /. This ap¬ 
proach does not impose any pre-specified model on /. It instead attempts 
to produce an estimate for / that is as close as possible to the observed 
data, subject to the fit—that is, the yellow surface in Figure 2.5 —being 
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FIGURE 2.6. A rough thin-plate spline fit to the Income data from Figure 2.3. 
This fit makes zero errors on the training data. 


smooth. In this case, the non-parametric fit has produced a remarkably ac¬ 
curate estimate of the true / shown in Figure 2.3. In order to fit a thin-plate 
spline, the data analyst must select a level of smoothness. Figure 2.6 shows 
the same thin-plate spline fit using a lower level of smoothness, allowing 
for a rougher fit. The resulting estimate fits the observed data perfectly! 
However, the spline fit shown in Figure 2.6 is far more variable than the 
true function /, from Figure 2.3. This is an example of overfitting the 
data, which we discussed previously. It is an undesirable situation because 
the fit obtained will not yield accurate estimates of the response on new 
observations that were not part of the original training data set. We dis¬ 
cuss methods for choosing the correct amount of smoothness in Chapter 5. 
Splines are discussed in Chapter 7. 

As we have seen, there are advantages and disadvantages to parametric 
and non-parametric methods for statistical learning. We explore both types 
of methods throughout this book. 


2.1.3 The Trade-Off Between Prediction Accuracy and Model 
Interpretability 

Of the many methods that we examine in this book, some are less flexible, 
or more restrictive, in the sense that they can produce just a relatively 
small range of shapes to estimate /. For example, linear regression is a 
relatively inflexible approach, because it can only generate linear functions 
such as the lines shown in Figure 2.1 or the plane shown in Figure 2.3. 
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FIGURE 2.7. A representation of the tradeoff between flexibility and inter- 
pretability, using different statistical learning methods. In general, as the flexibil¬ 
ity of a method increases, its interpretability decreases. 


Other methods, such as the thin plate splines shown in Figures 2.5 and 2.6, 
are considerably more flexible because they can generate a much wider 
range of possible shapes to estimate /. 

One might reasonably ask the following question: why would we ever 
choose to use a more restrictive method instead of a very flexible approach? 
There are several reasons that we might prefer a more restrictive model. 
If we are mainly interested in inference, then restrictive models are much 
more interpretable. For instance, when inference is the goal, the linear 
model may be a good choice since it will be quite easy to understand 
the relationship between Y and Xi,X^, ■ ■ ■ ,X p . In contrast, very flexible 
approaches, such as the splines discussed in Chapter 7 and displayed in 
Figures 2.5 and 2.6, and the boosting methods discussed in Chapter 8, can 
lead to such complicated estimates of / that it is difficult to understand 
how any individual predictor is associated with the response. 

Figure 2.7 provides an illustration of the trade-off between flexibility and 
interpretability for some of the methods that we cover in this book. Least 
squares linear regression, discussed in Chapter 3, is relatively inflexible but 
is quite interpretable. The lasso , discussed in Chapter 6, relies upon the 
linear model (2.4) but uses an alternative fitting procedure for estimating 
the coefficients flo, fli,... ,fl p . The new procedure is more restrictive in es¬ 
timating the coefficients, and sets a number of them to exactly zero. Hence 
in this sense the lasso is a less flexible approach than linear regression. 
It is also more interpretable than linear regression, because in the final 
model the response variable will only be related to a small subset of the 
predictors—namely, those with nonzero coefficient estimates. Generalized 
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additive models (GAMs), discussed in Chapter 7, instead extend the lin¬ 
ear model (2.4) to allow for certain non-linear relationships. Consequently, 
GAMs are more flexible than linear regression. They are also somewhat 
less interpretable than linear regression, because the relationship between 
each predictor and the response is now modeled using a curve. Finally, fully 
non-linear methods such as bagging , boosting , and support vector machines 
with non-linear kernels, discussed in Chapters 8 and 9, are highly flexible 
approaches that are harder to interpret. 

We have established that when inference is the goal, there are clear ad¬ 
vantages to using simple and relatively inflexible statistical learning meth¬ 
ods. In some settings, however, we are only interested in prediction, and 
the interpretability of the predictive model is simply not of interest. For 
instance, if we seek to develop an algorithm to predict the price of a 
stock, our sole requirement for the algorithm is that it predict accurately 
interpretability is not a concern. In this setting, we might expect that it 
will be best to use the most flexible model available. Surprisingly, this is 
not always the case! We will often obtain more accurate predictions using 
a less flexible method. This phenomenon, which may seem counterintuitive 
at first glance, has to do with the potential for overfitting in highly flexible 
methods. We saw an example of overfitting in Figure 2.6. We will discuss 
this very important concept further in Section 2.2 and throughout this 
book. 


2.1.4 Supervised Versus Unsupervised Learning 

Most statistical learning problems fall into one of two categories: supervised 
or unsupervised. The examples that we have discussed so far in this chap¬ 
ter all fall into the supervised learning domain. For each observation of the 
predictor measurement(s) x *, i = 1 ,... ,n there is an associated response 
measurement yi. We wish to fit a model that relates the response to the 
predictors, with the aim of accurately predicting the response for future 
observations (prediction) or better understanding the relationship between 
the response and the predictors (inference). Many classical statistical learn¬ 
ing methods such as linear regression and logistic regression (Chapter 4), as 
well as more modern approaches such as GAM, boosting, and support vec¬ 
tor machines, operate in the supervised learning domain. The vast majority 
of this book is devoted to this setting. 

In contrast, unsupervised learning describes the somewhat more chal¬ 
lenging situation in which for every observation i = 1,... ,n, we observe 
a vector of measurements x t but no associated response y,;. It is not pos¬ 
sible to fit a linear regression model, since there is no response variable 
to predict. In this setting, we are in some sense working blind; the sit¬ 
uation is referred to as unsupervised because we lack a response vari¬ 
able that can supervise our analysis. What sort of statistical analysis is 
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FIGURE 2.8. A clustering data set involving three groups. Each group is shown 
using a different colored symbol. Left: The three groups are well-separated. In 
this setting, a clustering approach should successfully identify the three groups. 
Right: There is some overlap among the groups. Now the clustering task is more 
challenging. 


possible? We can seek to understand the relationships between the variables 
or between the observations. One statistical learning tool that we may use 
in this setting is cluster analysis , or clustering. The goal of cluster analysis 
is to ascertain, on the basis of x \,..., x n , whether the observations fall into 
relatively distinct groups. For example, in a market segmentation study we 
might observe multiple characteristics (variables) for potential customers, 
such as zip code, family income, and shopping habits. We might believe 
that the customers fall into different groups, such as big spenders versus 
low spenders. If the information about each customer’s spending patterns 
were available, then a supervised analysis would be possible. However, this 
information is not available—that is, we do not know whether each poten¬ 
tial customer is a big spender or not. In this setting, we can try to cluster 
the customers on the basis of the variables measured, in order to identify 
distinct groups of potential customers. Identifying such groups can be of 
interest because it might be that the groups differ with respect to some 
property of interest, such as spending habits. 

Figure 2.8 provides a simple illustration of the clustering problem. We 
have plotted 150 observations with measurements on two variables, X\ 
and X 2 . Each observation corresponds to one of three distinct groups. For 
illustrative purposes, we have plotted the members of each group using 
different colors and symbols. However, in practice the group memberships 
are unknown, and the goal is to determine the group to which each ob¬ 
servation belongs. In the left-hand panel of Figure 2.8, this is a relatively 
easy task because the groups are well-separated. In contrast, the right-hand 
panel illustrates a more challenging problem in which there is some overlap 
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between the groups. A clustering method could not be expected to assign 
all of the overlapping points to their correct group (blue, green, or orange). 

In the examples shown in Figure 2.8, there are only two variables, and 
so one can simply visually inspect the scatterplots of the observations in 
order to identify clusters. However, in practice, we often encounter data 
sets that contain many more than two variables. In this case, we cannot 
easily plot the observations. For instance, if there are p variables in our 
data set, then p(jp — l)/2 distinct scatterplots can be made, and visual 
inspection is simply not a viable way to identify clusters. For this reason, 
automated clustering methods are important. We discuss clustering and 
other unsupervised learning approaches in Chapter 10. 

Many problems fall naturally into the supervised or unsupervised learn¬ 
ing paradigms. However, sometimes the question of whether an analysis 
should be considered supervised or unsupervised is less clear-cut. For in¬ 
stance, suppose that we have a set of n observations. For m of the observa¬ 
tions, where m < n, we have both predictor measurements and a response 
measurement. For the remaining n — m observations, we have predictor 
measurements but no response measurement. Such a scenario can arise if 
the predictors can be measured relatively cheaply but the corresponding 
responses are much more expensive to collect. We refer to this setting as 
a semi-supervised learning problem. In this setting, we wish to use a sta¬ 
tistical learning method that can incorporate the m observations for which 
response measurements are available as well as the n — m observations for 
which they are not. Although this is an interesting topic, it is beyond the 
scope of this book. 


2.1.5 Regression Versus Classification Problems 

Variables can be characterized as either quantitative or qualitative (also 
known as categorical). Quantitative variables take on numerical values. 
Examples include a person’s age, height, or income, the value of a house, 
and the price of a stock. In contrast, qualitative variables take on val¬ 
ues in one of K different classes, or categories. Examples of qualitative 
variables include a person’s gender (male or female), the brand of prod¬ 
uct purchased (brand A, B, or C), whether a person defaults on a debt 
(yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute 
Lymphoblastic Leukemia, or No Leukemia). We tend to refer to problems 
with a quantitative response as regression problems, while those involv¬ 
ing a qualitative response are often referred to as classification problems. 
However, the distinction is not always that crisp. Least squares linear re¬ 
gression (Chapter 3) is used with a quantitative response, whereas logistic 
regression (Chapter 4) is typically used with a qualitative (two-class, or 
binary ) response. As such it is often used as a classification method. But 
since it estimates class probabilities, it can be thought of as a regression 
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method as well. Some statistical methods, such as AT-nearest neighbors 
(Chapters 2 and 4) and boosting (Chapter 8), can be used in the case of 
either quantitative or qualitative responses. 

We tend to select statistical learning methods on the basis of whether 
the response is quantitative or qualitative; i.e. we might use linear regres¬ 
sion when quantitative and logistic regression when qualitative. However, 
whether the predictors are qualitative or quantitative is generally consid¬ 
ered less important. Most of the statistical learning methods discussed in 
this book can be applied regardless of the predictor variable type, provided 
that any qualitative predictors are properly coded before the analysis is 
performed. This is discussed in Chapter 3. 


2.2 Assessing Model Accuracy 

One of the key aims of this book is to introduce the reader to a wide range 
of statistical learning methods that extend far beyond the standard linear 
regression approach. Why is it necessary to introduce so many different 
statistical learning approaches, rather than just a single best method? There 
is no free lunch in statistics: no one method dominates all others over all 
possible data sets. On a particular data set, one specific method may work 
best, but some other method may work better on a similar but different 
data set. Hence it is an important task to decide for any given set of data 
which method produces the best results. Selecting the best approach can 
be one of the most challenging parts of performing statistical learning in 
practice. 

In this section, we discuss some of the most important concepts that 
arise in selecting a statistical learning procedure for a specific data set. As 
the book progresses, we will explain how the concepts presented here can 
be applied in practice. 


2.2.1 Measuring the Quality of Fit 

In order to evaluate the performance of a statistical learning method on 
a given data set, we need some way to measure how well its predictions 
actually match the observed data. That is, we need to quantify the extent 
to which the predicted response value for a given observation is close to 
the true response value for that observation. In the regression setting, the 
most commonly-used measure is the mean squared error (MSE), given by 

1 n 

MSE=-'£ l (y i -f(x i )) 2 , 

n z ' 


mean 
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where f(xi) is the prediction that / gives for the ith observation. The MSE 
will be small if the predicted responses are very close to the true responses, 
and will be large if for some of the observations, the predicted and true 
responses differ substantially. 

The MSE in (2.5) is computed using the training data that was used to 
fit the model, and so should more accurately be referred to as the training 
MSE. But in general, we do not really care how well the method works 
on the training data. Rather, we are interested in the accuracy of the pre¬ 
dictions that we obtain when we apply our method to previously unseen 
test data. Why is this what we care about? Suppose that we are interested 
in developing an algorithm to predict a stock’s price based on previous 
stock returns. We can train the method using stock returns from the past 
6 months. But we don’t really care how well our method predicts last week’s 
stock price. We instead care about how well it will predict tomorrow’s price 
or next month’s price. On a similar note, suppose that we have clinical 
measurements (e.g. weight, blood pressure, height, age, family history of 
disease) for a number of patients, as well as information about whether each 
patient has diabetes. We can use these patients to train a statistical learn¬ 
ing method to predict risk of diabetes based on clinical measurements. In 
practice, we want this method to accurately predict diabetes risk for future 
patients based on their clinical measurements. We are not very interested 
in whether or not the method accurately predicts diabetes risk for patients 
used to train the model, since we already know which of those patients 
have diabetes. 

To state it more mathematically, suppose that we fit our statistical learn¬ 
ing method on our training observations {(aq, y\), [x 2 , 2 / 2 ), (x n , y n )}, 
and we obtain the estimate /. We can then compute f{x 1 ), f(x 2 ),..., f{x n ). 
If these are approximately equal to y\, yi, ■ ■ ■, y n , then the training MSE 
given by (2.5) is small. However, we are really not interested in whether 
,f(xi) « yt] instead, we want to know whether f(x 0 ) is approximately equal 
to yo, where (xo,yo) is a previously unseen test observation not used to train 
the statistical learning method. We want to choose the method that gives 
the lowest test MSE, as opposed to the lowest training MSE. In other words, 
if we had a large number of test observations, we could compute 

Ave(y 0 - f(x 0 )) 2 , (2.6) 

the average squared prediction error for these test observations (xo>2/o)- 
We’d like to select the model for which the average of this quantity—the 
test MSE—is as small as possible. 

How can we go about trying to select a method that minimizes the test 
MSE? In some settings, we may have a test data set available—that is, 
we may have access to a set of observations that were not used to train 
the statistical learning method. We can then simply evaluate (2.6) on the 
test observations, and select the learning method for which the test MSE is 
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X Flexibility 

FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of 
f are shown: the linear regression line (orange curve), and two smoothing spline 
fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red 
curve), and minimum possible test MSE over all methods (dashed line). Squares 
represent the training and test MSEs for the three fits shown in the left-hand 
panel. 


smallest. But what if no test observations are available? In that case, one 
might imagine simply selecting a statistical learning method that minimizes 
the training MSE (2.5). This seems like it might be a sensible approach, 
since the training MSE and the test MSE appear to be closely related. 
Unfortunately, there is a fundamental problem with this strategy: there 
is no guarantee that the method with the lowest training MSE will also 
have the lowest test MSE. Roughly speaking, the problem is that many 
statistical methods specifically estimate coefficients so as to minimize the 
training set MSE. For these methods, the training set MSE can be quite 
small, but the test MSE is often much larger. 

Figure 2.9 illustrates this phenomenon on a simple example. In the left- 
hand panel of Figure 2.9, we have generated observations from (2.1) with 
the true / given by the black curve. The orange, blue and green curves illus¬ 
trate three possible estimates for / obtained using methods with increasing 
levels of flexibility. The orange line is the linear regression fit, which is rela¬ 
tively inflexible. The blue and green curves were produced using smoothing 
splines , discussed in Chapter 7, with different levels of smoothness. It is 
clear that as the level of flexibility increases, the curves fit the observed 
data more closely. The green curve is the most flexible and matches the 
data very well; however, we observe that it fits the true / (shown in black) 
poorly because it is too wiggly. By adjusting the level of flexibility of the 
smoothing spline fit, we can produce many different fits to this data. 
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We now move on to the right-hand panel of Figure 2.9. The grey curve 
displays the average training MSE as a function of flexibility, or more for¬ 
mally the degrees of freedom, for a number of smoothing splines. The de¬ 
grees of freedom is a quantity that summarizes the flexibility of a curve; it 
is discussed more fully in Chapter 7. The orange, blue and green squares 
indicate the MSEs associated with the corresponding curves in the left- 
hand panel. A more restricted and hence smoother curve has fewer degrees 
of freedom than a wiggly curve—note that in Figure 2.9, linear regression 
is at the most restrictive end, with two degrees of freedom. The training 
MSE declines monotonically as flexibility increases. In this example the 
true / is non-linear, and so the orange linear fit is not flexible enough to 
estimate / well. The green curve has the lowest training MSE of all three 
methods, since it corresponds to the most flexible of the three curves fit in 
the left-hand panel. 

In this example, we know the true function /, and so we can also com¬ 
pute the test MSE over a very large test set, as a function of flexibility. (Of 
course, in general / is unknown, so this will not be possible.) The test MSE 
is displayed using the red curve in the right-hand panel of Figure 2.9. As 
with the training MSE, the test MSE initially declines as the level of flex¬ 
ibility increases. However, at some point the test MSE levels off and then 
starts to increase again. Consequently, the orange and green curves both 
have high test MSE. The blue curve minimizes the test MSE, which should 
not be surprising given that visually it appears to estimate / the best in the 
left-hand panel of Figure 2.9. The horizontal dashed line indicates Var(e), 
the irreducible error in (2.3), which corresponds to the lowest achievable 
test MSE among all possible methods. Hence, the smoothing spline repre¬ 
sented by the blue curve is close to optimal. 

In the right-hand panel of Figure 2.9, as the flexibility of the statistical 
learning method increases, we observe a monotone decrease in the training 
MSE and a U-shape in the test MSE. This is a fundamental property of 
statistical learning that holds regardless of the particular data set at hand 
and regardless of the statistical method being used. As model flexibility 
increases, training MSE will decrease, but the test MSE may not. When 
a given method yields a small training MSE but a large test MSE, we are 
said to be overfitting the data. This happens because our statistical learning 
procedure is working too hard to find patterns in the training data, and 
may be picking up some patterns that are just caused by random chance 
rather than by true properties of the unknown function /. When we overfit 
the training data, the test MSE will be very large because the supposed 
patterns that the method found in the training data simply don’t exist 
in the test data. Note that regardless of whether or not overfitting has 
occurred, we almost always expect the training MSE to be smaller than 
the test MSE because most statistical learning methods either directly or 
indirectly seek to minimize the training MSE. Overfitting refers specifically 
to the case in which a less flexible model would have yielded a smaller 
test MSE. 
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FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is 
much closer to linear. In this setting, linear regression provides a very good fit to 
the data. 


Figure 2.10 provides another example in which the true / is approxi¬ 
mately linear. Again we observe that the training MSE decreases mono- 
tonically as the model flexibility increases, and that there is a U-shape in 
the test MSE. However, because the truth is close to linear, the test MSE 
only decreases slightly before increasing again, so that the orange least 
squares fit is substantially better than the highly flexible green curve. Fi¬ 
nally, Figure 2.11 displays an example in which / is highly non-linear. The 
training and test MSE curves still exhibit the same general patterns, but 
now there is a rapid decrease in both curves before the test MSE starts to 
increase slowly. 

In practice, one can usually compute the training MSE with relative 
ease, but estimating test MSE is considerably more difficult because usually 
no test data are available. As the previous three examples illustrate, the 
flexibility level corresponding to the model with the minimal test MSE can 
vary considerably among data sets. Throughout this book, we discuss a 
variety of approaches that can be used in practice to estimate this minimum 
point. One important method is cross-validation (Chapter 5), which is a 
method for estimating test MSE using the training data. 


2.2.2 The Bias-Variance Trade-Off 

The U-shape observed in the test MSE curves (Figures 2.9-2.11) turns out 
to be the result of two competing properties of statistical learning methods. 
Though the mathematical proof is beyond the scope of this book, it is 
possible to show that the expected test MSE, for a given value xq, can 
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X Flexibility 

FIGURE 2.11. Details are as in Figure 2.9, using a different f that is far from 
linear. In this setting, linear regression provides a very poor fit to the data. 


always be decomposed into the sum of three fundamental quantities: the 
variance of f(x o), the squared bias of f(x o) and the variance of the error 
terms e. That is, 

E (yo - f( x oj) = Var(/(aj 0 )) + [Bias(/(ato))] 2 + Var(e). (2.7) 

Here the notation E (yo — /(xq)^ defines the expected test MSE , and refers 
to the average test MSE that we would obtain if we repeatedly estimated 
/ using a large number of training sets, and tested each at xq- The overall 

expected test MSE can be computed by averaging E (yo — f( x o)^ over all 
possible values of xq in the test set. 

Equation 2.7 tells us that in order to minimize the expected test error, 
we need to select a statistical learning method that simultaneously achieves 
low variance and low bias. Note that variance is inherently a nonnegative 
quantity, and squared bias is also nonnegative. Hence, we see that the 
expected test MSE can never lie below Var(e), the irreducible error from 
(2.3). 

What do we mean by the variance and bias of a statistical learning 
method? Variance refers to the amount by which / would change if we 
estimated it using a different training data set. Since the training data 
are used to fit the statistical learning method, different training data sets 
will result in a different /. But ideally the estimate for / should not vary 
too much between training sets. However, if a method has high variance 
then small changes in the training data can result in large changes in /. In 
general, more flexible statistical methods have higher variance. Consider the 
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green and orange curves in Figure 2.9. The flexible green curve is following 
the observations very closely. It has high variance because changing any 
one of these data points may cause the estimate / to change considerably. 
In contrast, the orange least squares line is relatively inflexible and has low 
variance, because moving any single observation will likely cause only a 
small shift in the position of the line. 

On the other hand, bias refers to the error that is introduced by approxi¬ 
mating a real-life problem, which may be extremely complicated, by a much 
simpler model. For example, linear regression assumes that there is a linear 
relationship between Y and X\, X 2 , ■. ., X p . It is unlikely that any real-life 
problem truly has such a simple linear relationship, and so performing lin¬ 
ear regression will undoubtedly result in some bias in the estimate of /. In 
Figure 2.11, the true / is substantially non-linear, so no matter how many 
training observations we are given, it will not be possible to produce an 
accurate estimate using linear regression. In other words, linear regression 
results in high bias in this example. However, in Figure 2.10 the true / is 
very close to linear, and so given enough data, it should be possible for 
linear regression to produce an accurate estimate. Generally, more flexible 
methods result in less bias. 

As a general rule, as we use more flexible methods, the variance will 
increase and the bias will decrease. The relative rate of change of these 
two quantities determines whether the test MSE increases or decreases. As 
we increase the flexibility of a class of methods, the bias tends to initially 
decrease faster than the variance increases. Consequently, the expected 
test MSE declines. However, at some point increasing flexibility has little 
impact on the bias but starts to significantly increase the variance. When 
this happens the test MSE increases. Note that we observed this pattern 
of decreasing test MSE followed by increasing test MSE in the right-hand 
panels of Figures 2.9-2.11. 

The three plots in Figure 2.12 illustrate Equation 2.7 for the examples in 
Figures 2.9-2.11. In each case the blue solid curve represents the squared 
bias, for different levels of flexibility, while the orange curve corresponds to 
the variance. The horizontal dashed line represents Var(e), the irreducible 
error. Finally, the red curve, corresponding to the test set MSE, is the sum 
of these three quantities. In all three cases, the variance increases and the 
bias decreases as the method’s flexibility increases. However, the flexibility 
level corresponding to the optimal test MSE differs considerably among the 
three data sets, because the squared bias and variance change at different 
rates in each of the data sets. In the left-hand panel of Figure 2.12, the 
bias initially decreases rapidly, resulting in an initial sharp decrease in the 
expected test MSE. On the other hand, in the center panel of Figure 2.12 
the true / is close to linear, so there is only a small decrease in bias as flex¬ 
ibility increases, and the test MSE only declines slightly before increasing 
rapidly as the variance increases. Finally, in the right-hand panel of Fig¬ 
ure 2.12, as flexibility increases, there is a dramatic decline in bias because 
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FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(e) 
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9-2.11. 
The vertical dotted line indicates the flexibility level corresponding to the smallest 
test MSE. 


the true / is very non-linear. There is also very little increase in variance 
as flexibility increases. Consequently, the test MSE declines substantially 
before experiencing a small increase as model flexibility increases. 

The relationship between bias, variance, and test set MSE given in Equa¬ 
tion 2.7 and displayed in Figure 2.12 is referred to as the bias-variance 
trade-off. Good test set performance of a statistical learning method re- 

... bias-variance 

quires low variance as well as low squared bias. This is referred to as a trade-off 
trade-off because it is easy to obtain a method with extremely low bias but 
high variance (for instance, by drawing a curve that passes through every 
single training observation) or a method with very low variance but high 
bias (by fitting a horizontal line to the data). The challenge lies in finding 
a method for which both the variance and the squared bias are low. This 
trade-off is one of the most important recurring themes in this book. 

In a real-life situation in which / is unobserved, it is generally not pos¬ 
sible to explicitly compute the test MSE, bias, or variance for a statistical 
learning method. Nevertheless, one should always keep the bias-variance 
trade-off in mind. In this book we explore methods that are extremely 
flexible and hence can essentially eliminate bias. However, this does not 
guarantee that they will outperform a much simpler method such as linear 
regression. To take an extreme example, suppose that the true / is linear. 

In this situation linear regression will have no bias, making it very hard 
for a more flexible method to compete. In contrast, if the true / is highly 
non-linear and we have an ample number of training observations, then 
we may do better using a highly flexible approach, as in Figure 2.11. In 
Chapter 5 we discuss cross-validation, which is a way to estimate the test 
MSE using the training data. 

















2.2 Assessing Model Accuracy 37 


2.2.3 The Classification Setting 

Thus far, our discussion of model accuracy has been focused on the regres¬ 
sion setting. But many of the concepts that we have encountered, such 
as the bias-variance trade-off, transfer over to the classification setting 
with only some modifications due to the fact that yi is no longer numer¬ 
ical. Suppose that we seek to estimate / on the basis of training obser¬ 
vations {(aq, yi), ..., (x n , y n )}, where now yi, ■ ■ ■ ,y n are qualitative. The 
most common approach for quantifying the accuracy of our estimate / is 
the training error rate, the proportion of mistakes that are made if we apply 
our estimate / to the training observations: 

1 ” 

( 2 - 8 ) 

n 

i =1 

Here yi is the predicted class label for the ith observation using /. And 
I{yi Vi) is an indicator variable that equals 1 if yi i/i and zero if yi = iji- 
If I (yi ^ yi) = 0 then the ?'th observation was classified correctly by our 
classification method; otherwise it was misclassified. Hence Equation 2.8 
computes the fraction of incorrect classifications. 

Equation 2.8 is referred to as the training error rate because it is com¬ 
puted based on the data that was used to train our classifier. As in the 
regression setting, we are most interested in the error rates that result from 
applying our classifier to test observations that were not used in training. 
The test error rate associated with a set of test observations of the form 
(x 0l y 0 ) is given by 

Ave (I(yo yo)), (2.9) 

where yo is the predicted class label that results from applying the classifier 
to the test observation with predictor Xq■ A good classifier is one for which 
the test error (2.9) is smallest. 

The Bayes Classifier 

It is possible to show (though the proof is outside of the scope of this 
book) that the test error rate given in (2.9) is minimized, on average, by a 
very simple classifier that assigns each observation to the most likely class, 
given its predictor values. In other words, we should simply assign a test 
observation with predictor vector Xq to the class j for which 


Pr(y = j\X = xo) (2.10) 

is largest. Note that (2.10) is a conditional probability: it is the probability 
that Y = j, given the observed predictor vector Xq. This very simple clas¬ 
sifier is called the Bayes classifier. In a two-class problem where there are 
only two possible response values, say class 1 or class 2 , the Bayes classifier 


error rate 


indicator 

variable 


training 

error 


test error 


conditional 

probability 

Bayes 

classifier 
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FIGURE 2.13. A simulated data set consisting of 100 observations in each of 
two groups, indicated in blue and in orange. The purple dashed line represents 
the Bayes decision boundary. The orange background grid indicates the region 
in which a test observation will be assigned to the orange class, and the blue 
background grid indicates the region in which a test observation will be assigned 
to the blue class. 


corresponds to predicting class one if Pr(Y = l\X = xo) > 0.5, and class 
two otherwise. 

Figure 2.13 provides an example using a simulated data set in a two- 
dimensional space consisting of predictors X\ and X 2 . The orange and 
blue circles correspond to training observations that belong to two different 
classes. For each value of X\ and X 2 , there is a different probability of the 
response being orange or blue. Since this is simulated data, we know how 
the data were generated and we can calculate the conditional probabilities 
for each value of A'i and A' 2 . The orange shaded region reflects the set of 
points for which Pr(Y = orangejA') is greater than 50%, while the blue 
shaded region indicates the set of points for which the probability is below 
50%. The purple dashed line represents the points where the probability 
is exactly 50%. This is called the Bayes decision boundary. The Bayes 
classifier’s prediction is determined by the Bayes decision boundary; an 
observation that falls on the orange side of the boundary will be assigned 
to the orange class, and similarly an observation on the blue side of the 
boundary will be assigned to the blue class. 

The Bayes classifier produces the lowest possible test error rate, called 
the Bayes error rate. Since the Bayes classifier will always choose the class 
for which (2.10) is largest, the error rate at X = xo will be 1 —max,, Pr(Y = 
j\X = xq). In general, the overall Bayes error rate is given by 


Bayes 

decision 

boundary 


Bayes error 
rate 


max Pr(Y 
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where the expectation averages the probability over all possible values of 
X. For our simulated data, the Bayes error rate is 0.1304. It is greater than 
zero, because the classes overlap in the true population so max,, Pr(Y = 
j\X = xo) < 1 for some values of xq. The Bayes error rate is analogous to 
the irreducible error, discussed earlier. 


K-Nearest Neighbors 

In theory we would always like to predict qualitative responses using the 
Bayes classifier. But for real data, we do not know the conditional distri¬ 
bution of Y given X , and so computing the Bayes classifier is impossi¬ 
ble. Therefore, the Bayes classifier serves as an unattainable gold standard 
against which to compare other methods. Many approaches attempt to 
estimate the conditional distribution of Y given X , and then classify a 
given observation to the class with highest estimated probability. One such 
method is the K-nearest neighbors (KNN) classifier. Given a positive in¬ 
teger K and a test observation xo, the KNN classifier first identifies the 
K points in the training data that are closest to xq, represented by TVq. 
It then estimates the conditional probability for class j as the fraction of 
points in Afo whose response values equal j: 

Pr(Y = j\X = x 0 ) = ± ]r/( yi =j). (2.12) 

ieWo 


Finally, KNN applies Bayes rule and classifies the test observation xq to 
the class with the largest probability. 

Figure 2.14 provides an illustrative example of the KNN approach. In 
the left-hand panel, we have plotted a small training data set consisting of 
six blue and six orange observations. Our goal is to make a prediction for 
the point labeled by the black cross. Suppose that we choose K = 3. Then 
KNN will first identify the three observations that are closest to the cross. 
This neighborhood is shown as a circle. It consists of two blue points and 
one orange point, resulting in estimated probabilities of 2/3 for the blue 
class and 1/3 for the orange class. Hence KNN will predict that the black 
cross belongs to the blue class. In the right-hand panel of Figure 2.14 we 
have applied the KNN approach with I\ = 3 at all of the possible values for 
X\ and X 2 , and have drawn in the corresponding KNN decision boundary. 

Despite the fact that it is a very simple approach, KNN can often pro¬ 
duce classifiers that are surprisingly close to the optimal Bayes classifier. 
Figure 2.15 displays the KNN decision boundary, using K = 10, when ap¬ 
plied to the larger simulated data set from Figure 2.13. Notice that even 
though the true distribution is not known by the KNN classifier, the KNN 
decision boundary is very close to that of the Bayes classifier. The test error 
rate using KNN is 0.1363, which is close to the Bayes error rate of 0.1304. 


i^-nearest 

neighbors 
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FIGURE 2.14. The KNN approach, using K = 3, is illustrated in a simple 
situation with six blue observations and six orange observations. Left: a test ob¬ 
servation at which a predicted class label is desired is shown as a black cross. The 
three closest points to the test observation are identified, and it is predicted that 
the test observation belongs to the most commonly-occurring class, in this case 
blue. Right: The KNN decision boundary for this example is shown in black. The 
blue grid indicates the region in which a test observation will be assigned to the 
blue class, and the orange grid indicates the region in which it will be assigned to 
the orange class. 


The choice of K has a drastic effect on the KNN classifier obtained. 
Figure 2.16 displays two KNN fits to the simulated data from Figure 2.13, 
using K — 1 and K = 100. When K — 1, the decision boundary is overly 
flexible and finds patterns in the data that don’t correspond to the Bayes 
decision boundary. This corresponds to a classifier that has low bias but 
very high variance. As K grows, the method becomes less flexible and 
produces a decision boundary that is close to linear. This corresponds to 
a low-variance but high-bias classifier. On this simulated data set, neither 
K = 1 nor K = 100 give good predictions: they have test error rates of 
0.1695 and 0.1925, respectively. 

Just as in the regression setting, there is not a strong relationship be¬ 
tween the training error rate and the test error rate. With K = 1, the 
KNN training error rate is 0, but the test error rate may be quite high. In 
general, as we use more flexible classification methods, the training error 
rate will decline but the test error rate may not. In Figure 2.17, we have 
plotted the KNN test and training errors as a function of 1/K. As 1/K in¬ 
creases, the method becomes more flexible. As in the regression setting, the 
training error rate consistently declines as the flexibility increases. However, 
the test error exhibits a characteristic U-shape, declining at first (with a 
minimum at approximately K = 10) before increasing again when the 
method becomes excessively flexible and overfits. 
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KNN: K—10 



FIGURE 2.15. The black curve indicates the KNN decision boundary on the 
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as 
a purple dashed line. The KNN and Bayes decision boundaries are very similar. 


KNN: K=1 


KNN: K=100 



FIGURE 2.16. A comparison of the KNN decision boundaries (solid black 
curves) obtained using K = 1 and K = 100 on the data from Figure 2.13. With 
K = 1, the decision boundary is overly flexible, while with K = 100 it is not 
sufficiently flexible. The Bayes decision boundary is shown as a purple dashed 
line. 
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FIGURE 2.17. The KNN training error rate (blue, 200 observations) and test 
error rate (orange, 5,000 observations) on the data from Figure 2.13, as the 
level of flexibility (assessed using 1 /K) increases, or equivalently as the number 
of neighbors K decreases. The black dashed line indicates the Bayes error rate. 
The jumpiness of the curves is due to the small size of the training data set. 

In both the regression and classification settings, choosing the correct 
level of flexibility is critical to the success of any statistical learning method. 
The bias-variance tradeoff, and the resulting U-shape in the test error, can 
make this a difficult task. In Chapter 5, we return to this topic and discuss 
various methods for estimating test error rates and thereby choosing the 
optimal level of flexibility for a given statistical learning method. 


2.3 Lab: Introduction to R 

In this lab, we will introduce some simple R commands. The best way to 
learn a new language is to try out the commands. R can be downloaded from 


http://cran.r-proj ect.org/ 


2.3.1 Basic Commands 


R uses functions to perform operations. To run a function called funename, f t . 
we type funename (input 1, input2), where the inputs (or arguments ) inputl umcnt 
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and input2 tell R how to run the function. A function can have any number 
of inputs. For example, to create a vector of numbers, we use the function 
c() (for concatenate). Any numbers inside the parentheses are joined to¬ 
gether. The following command instructs R to join together the numbers 
1, 3, 2, and 5, and to save them as a vector named x. When we type x, it 
gives us back the vector. 

> x <- c(1,3,2,5) 

> x 

[1] 13 2 5 

Note that the > is not part of the command; rather, it is printed by R to 
indicate that it is ready for another command to be entered. We can also 
save things using = rather than <-: 

> x = c(l ,6,2) 

> x 

[1] 162 

> y = c (1,4,3) 

Hitting the up arrow multiple times will display the previous commands, 
which can then be edited. This is useful since one often wishes to repeat 
a similar command. In addition, typing ?funcname will always cause R to 
open a new help hie window with additional information about the function 
funcname. 

We can tell R to add two sets of numbers together. It will then add the 
first number from x to the first number from y, and so on. However, x and 
y should be the same length. We can check their length using the length!) 
function. 

> length(x) 

[1] 3 

> length(y) 

[1] 3 

> x + y 

[1] 2 10 5 

The ls() function allows us to look at a list of all of the objects, such 
as data and functions, that we have saved so far. The rm() function can be 
used to delete any that we don’t want. 

> 1st) 

[1] "x" "y" 

> rm(x , y) 

> ls() 

character (0) 

It’s also possible to remove all objects at once: 


c() 

vector 


length!) 


Is!) 

rm!) 


> rm!list=ls!)) 
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The matrix () function can be used to create a matrix of numbers. Before 
we use the matrix() function, we can learn more about it: 

> ?matrix 

The help file reveals that the matrix() function takes a number of inputs, 
but for now we focus on the first three: the data (the entries in the matrix), 
the number of rows, and the number of columns. First, we create a simple 
matrix. 

> x=matrix(data=c(1,2,3,4), nrow=2, ncol=2) 

> x 

[,11 [,2] 

[1,1 1 3 

[2,] 2 4 

Note that we could just as well omit typing data=, nrow=, and ncol= in the 
matrix() command above: that is, we could just type 

> x=matrix(c(1,2,3,4) ,2,2) 

and this would have the same effect. However, it can sometimes be useful to 
specify the names of the arguments passed in, since otherwise R will assume 
that the function arguments are passed into the function in the same order 
that is given in the function’s help file. As this example illustrates, by 
default R creates matrices by successively filling in columns. Alternatively, 
the byrow=TRUE option can be used to populate the matrix in order of the 
rows. 

> matrix(c(1,2,3,4) , 2,2,byrow = TRUE) 

[,ll [ ,2] 

[1,1 1 2 
[2,] 3 4 

Notice that in the above command we did not assign the matrix to a value 
such as x. In this case the matrix is printed to the screen but is not saved 
for future calculations. The sqrt() function returns the square root of each 
element of a vector or matrix. The command x~2 raises each element of x 
to the power 2; any powers are possible, including fractional or negative 
powers. 

> sqrt(x) 

[,1] [ , 2] 

[1,] 1.00 1.73 

[2,] 1.41 2.00 

> x “ 2 

[,1] [,2] 

[1,1 1 9 

[2,] 4 16 

The rnormO function generates a vector of random normal variables, 
with first argument n the sample size. Each time we call this function, we 
will get a different answer. Here we create two correlated sets of numbers, 
x and y, and use the cor() function to compute the correlation between 
them. 


matrix() 


sqrt() 


rnorm() 


cor() 
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> x=rnorm(50) 

> y=x+rnorm(50,mean=50,sd=.1) 

> cor(x , y) 

[1] 0.995 

By default, rnormO creates standard normal random variables with a mean 
of 0 and a standard deviation of 1. However, the mean and standard devi¬ 
ation can be altered using the mean and sd arguments, as illustrated above. 
Sometimes we want our code to reproduce the exact same set of random 
numbers; we can use the set.seedO function to do this. The set.seedO 
function takes an (arbitrary) integer argument. 

> set . seed (1303) 

> rnorm (50) 

[1] -1.1440 1.3421 2.1854 0.5364 0.0632 0.5022 -0.0004 

We use set.seedO throughout the labs whenever we perform calculations 
involving random quantities. In general this should allow the user to re¬ 
produce our results. However, it should be noted that as new versions of 
R become available it is possible that some small discrepancies may form 
between the book and the output from R. 

The mean() and var() functions can be used to compute the mean and 
variance of a vector of numbers. Applying sqrtO to the output of var() 
will give the standard deviation. Or we can simply use the sd() function. 

> set.seed(3) 

> y=rnorm(100) 

> mean(y) 

[ 1 ] 0.0110 

> var(y) 

[1] 0.7329 

> sqrt(var(y)) 

[1] 0.8561 

> sd(y) 

[1] 0.8561 


2.3.2 Graphics 

The plot() function is the primary way to plot data in R. For instance, 
plot(x,y) produces a scatterplot of the numbers in x versus the numbers 
in y. There are many additional options that can be passed in to the plotO 
function. For example, passing in the argument xlab will result in a label 
on the a;-axis. To find out more information about the plotO function, 
type Tplot. 

> x=rnorm(100) 

> y=rnorm(100) 

> plot(x,y) 

> plot (x , y , xlab = " this is the x-axis ",ylab = "this is the y-axis", 

main="Plot of X vs Y") 


set . seedO 
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var() 

sd() 
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We will often want to save the output of an R plot. The command that we 
use to do this will depend on the file type that we would like to create. For 
instance, to create a pdf, we use the pdf () function, and to create a jpeg, 
we use the jpegO function. 

> pdf (" Figure . pdf ") 

> plot(x,y,col="green") 

> dev.off () 
null device 

1 

The function dev.off () indicates to R that we are done creating the plot. 
Alternatively, we can simply copy the plot window and paste it into an 
appropriate file type, such as a Word document. 

The function seq() can be used to create a sequence of numbers. For 
instance, seq(a,b) makes a vector of integers between a and b. There are 
many other options: for instance, seq(0,l,length=10) makes a sequence of 
10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a 
shorthand for seq(3,ll) for integer arguments. 

> x = seq(1 ,10) 

> X 

[1] 1 2 3 4 5 6 7 8 9 10 

> x=l:10 

> x 

[1] 1 2 3 4 5 6 7 8 9 10 

> x = seq (-pi , pi , length =50) 

We will now create some more sophisticated plots. The contour () func¬ 
tion produces a contour plot in order to represent three-dimensional data; 
it is like a topographical map. It takes three arguments: 

1. A vector of the x values (the first dimension), 

2. A vector of the y values (the second dimension), and 

3. A matrix whose elements correspond to the z value (the third dimen¬ 
sion) for each pair of (x,y) coordinates. 

As with the plotO function, there are many other inputs that can be used 
to fine-tune the output of the contour () function. To learn more about 
these, take a look at the help file by typing ?contour. 

> y=x 

> f=outer(x,y,function(x,y)cos(y)/(l+x~2)) 

> contour(x,y,f) 

> contour(x,y,f,nlevels=45,add=T) 

> fa=(f-t(f))/2 

> contour (x , y , fa , nlevels =15) 

The image () function works the same way as contour (), except that it 
produces a color-coded plot whose colors depend on the z value. This is 


pdf () 

jpegO 


dev.off() 

seq() 


contour() 

contour plot 


image() 
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known as a heatmap , and is sometimes used to plot temperature in weather 
forecasts. Alternatively, perspO can be used to produce a three-dimensional 
plot. The arguments theta and phi control the angles at which the plot is 
viewed. 


heatmap 

perspO 


> image(x,y,fa) 

> persp(x,y,fa) 

> persp(x,y,fa,theta=30) 

> persp(x,y,fa,theta=30,phi=20) 

> persp(x,y,fa,theta=30,phi=70) 

> persp(x,y,fa,theta=30,phi=40) 


2.3.3 Indexing Data 

We often wish to examine part of a set of data. Suppose that our data is 
stored in the matrix A. 


> A=matrix(1:16,4,4) 

> A 



[,1] 

[, 2] 

[ .3] 

[, 4] 

[1,] 

1 

5 

9 

13 

[2,] 

2 

6 

10 

14 

[3,] 

3 

7 

11 

15 

[4,] 

4 

8 

12 

16 


Then, typing 

> A[2,3] 

[ 1 ] 10 


will select the element corresponding to the second row and the third col¬ 
umn. The first number after the open-bracket symbol [ always refers to 
the row, and the second number always refers to the column. We can also 
select multiple rows and columns at a time, by providing vectors as the 
indices. 


> A [c (l,3) , c (2,4) ] 


[, 

1] 

[, 2] 



[1 ,] 

5 

13 



[2,] 

7 

15 



co 

<fi 

A 

,2: 

4] 



[, 

1] 

[ , 2] 

[ , 3] 


[l,] 

5 

9 

13 


[2,] 

6 

10 

14 


[3,] 

7 

11 

15 


> A [ 1 : 2 , ] 




[, 

1] 

[ , 2] 

[ ,3] 

[ ,4] 

[1 ,] 

1 

5 

9 

13 

[2,] 

2 

6 

10 

14 

> A [ , 1 : 

21 




[, 

1] 

[,2] 



[1 ,] 

1 

5 



[2,] 

2 

6 




48 


2. Statistical Learning 


[3,] 3 7 

[4,] 4 8 

The last two examples include either no index for the columns or no index 
for the rows. These indicate that R should include all columns or all rows, 
respectively. R treats a single row or column of a matrix as a vector. 

> A[l,] 

[1] 15 9 13 

The use of a negative sign - in the index tells R to keep all rows or columns 
except those indicated in the index. 

> A [-c (l ,3) ,] 

[,1] 1,21 [ , 3] [ ,4] 

[1,] 2 6 10 14 

[2,] 4 8 12 16 

> A[-c(1 ,3) ,-c(1,3,4)] 

[ 1 ] 6 8 

The dim() function outputs the number of rows followed by the number of 
columns of a given matrix. 

> dim(A) 

[1] 4 4 


2.3.4 Loading Data 

For most analyses, the first step involves importing a data set into R. The 
read.table () function is one of the primary ways to do this. The help file 
contains details about how to use this function. We can use the function 
write.table () to export data. 

Before attempting to load a data set, we must make sure that R knows 
to search for the data in the proper directory. For example on a Windows 
system one could select the directory using the Change dir. .. option under 
the File menu. However, the details of how to do this depend on the op¬ 
erating system (e.g. Windows, Mac, Unix) that is being used, and so we 
do not give further details here. We begin by loading in the Auto data set. 
This data is part of the ISLR library (we discuss libraries in Chapter 3) but 
to illustrate the read.tablet) function we load it now from a text hie. The 
following command will load the Auto.data hie into R and store it as an 
object called Auto, in a format referred to as a data frame. (The text hie 
can be obtained from this book’s website.) Once the data has been loaded, 
the fix() function can be used to view it in a spreadsheet like window. 
However, the window must be closed before further R commands can be 
entered. 

> Auto=read.table("Auto.data") 

> fix(Auto) 


dimO 


read.table() 

write. 
table() 


data frame 
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Note that Auto.data is simply a text file, which you could alternatively 
open on your computer using a standard text editor. It is often a good idea 
to view a data set using a text editor or other software such as Excel before 
loading it into R. 

This particular data set has not been loaded correctly, because R has 
assumed that the variable names are part of the data and so has included 
them in the first row. The data set also includes a number of missing 
observations, indicated by a question mark ?. Missing values are a common 
occurrence in real data sets. Using the option header=T (or header=TRUE) in 
the read.table() function tells R that the first line of the file contains the 
variable names, and using the option na. strings tells R that any time it 
sees a particular character or set of characters (such as a question mark), 
it should be treated as a missing element of the data matrix. 

> Auto = read . table (" Auto . data" , header=T , na . strings = "?" ) 

> fix(Auto) 

Excel is a common-format data storage program. An easy way to load such 
data into R is to save it as a csv (comma separated value) file and then use 
the read.csv() function to load it in. 

> Auto=read.csv("Auto.csv",header=T,na.strings="?") 

> fix(Auto) 

> dim(Auto) 

[1] 397 9 

> Auto [1:4,] 

The dimO function tells us that the data has 397 observations, or rows, and 
nine variables, or columns. There are various ways to deal with the missing 
data. In this case, only five of the rows contain missing observations, and 
so we choose to use the na.omitO function to simply remove these rows. 

> Auto=na.omit(Auto) 

> dim(Auto) 

[1] 392 9 

Once the data are loaded correctly, we can use names() to check the 
variable names. 

> names(Auto) 

[1] "mpg" "cylinders" "displacement" "horsepower" 

[5] "weight" "acceleration" "year" "origin" 

[9] "name" 


2.3.5 Additional Graphical and Numerical Summaries 

We can use the plotO function to produce scatterplots of the quantitative 
variables. However, simply typing the variable names will produce an error 
message, because R does not know to look in the Auto data set for those 
variables. 


dim() 


na. omit () 


names() 


scatterplot 
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> plot(cylinders, mpg) 

Error in plot(cylinders, mpg) : object ’cylinders' not found 

To refer to a variable, we must type the data set and the variable name 
joined with a $ symbol. Alternatively, we can use the attach() function in 
order to tell R to make the variables in this data frame available by name. 

> plot(Auto$cy1inders , Auto$mpg) 

> att ach ( Aut o ) 

> plot(cylinders , mpg) 

The cylinders variable is stored as a numeric vector, so R has treated it 
as quantitative. However, since there are only a small number of possible 
values for cylinders, one may prefer to treat it as a qualitative variable. 
The as.factorO function converts quantitative variables into qualitative 
variables. 

> cylinders=as.factor(cylinders) 

If the variable plotted on the a>axis is categorial, then boxplots will 
automatically be produced by the plot() function. As usual, a number 
of options can be specified in order to customize the plots. 


> 

plot(cylinders , 

mpg) 




> 

plot(cylinders , 

mpg . 

col = "red " ) 



> 

plot(cylinders , 

mpg . 

col = "red " , 

varwidth =T) 


> 

plot(cylinders , 

mpg . 

col = "red " , 

varwidth =T , 

horizontal=T) 

> 

plot(cylinders , 

mpg . 

col = "red " , 

varwidth =T , 

xlab="cylinders 


ylab="MPG") 


The histO function can be used to plot a histogram. Note that col=2 
has the same effect as col="red". 

> hist(mpg) 

> hist (mpg , col =2) 

> hist (mpg , col=2 , breaks =15) 

The pairs () function creates a scatterplot matrix i.e. a scatterplot for every 
pair of variables for any given data set. We can also produce scatterplots 
for just a subset of the variables. 

> pairs(Auto) 

> pairs(~ mpg + displacement + horsepower + weight + 

acceleration , Auto) 

In conjunction with the plot() function, identify () provides a useful 
interactive method for identifying the value for a particular variable for 
points on a plot. We pass in three arguments to identify 0: the a;-axis 
variable, the y-axis variable, and the variable whose values we would like 
to see printed for each point. Then clicking on a given point in the plot 
will cause R to print the value of the variable of interest. Right-clicking on 
the plot will exit the identifyO function (control-click on a Mac). The 
numbers printed under the identifyO function correspond to the rows for 
the selected points. 


attach() 


as.factor() 


boxplot 


hist() 

histogram 


scatterplot 

matrix 


identifyO 
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> plot(horsepower,mpg) 

> identify(horsepower,mpg,name) 


The summary () function produces a numerical summary of each variable in 
a particular data set. 


summary() 


> summary(Auto) 
mpg 


cylinders 


displacement 


Min . 


9 . 

00 

Min . 


1 St 

Qu . 

17 . 

00 

1 St 

Qu 

Medi 

an 

22 . 

75 

Medi 

an 

Mean 


23 . 

45 

Mean 


3rd 

Qu . 

29 . 

00 

3rd 

Qu 

Max . 


46 . 

60 

Max . 


ho 

rsep 

>owe 

r 


we 

Min . 


46 

. 0 

Min . 


1 St 

Qu . 

75 

.0 

1 St 

Qu 

Medi 

an 

93 

. 5 

Medi 

an 

Mean 


104 

. 5 

Mean 


3rd 

Qu . 

126 

. 0 

3rd 

Qu 

Max . 


230 

. 0 

Max . 



3.000 

Min . 

68.0 

4.000 

1st Qu . 

105.0 

4.000 

Median 

151.0 

5.472 

Mean 

194.4 

8.000 

3rd Qu . 

275.8 

8.000 

Max . 

455.0 


ght 

acceleration 

: 1613 

Min . 

8.00 

: 2225 

1st Qu . 

13.78 

: 2804 

Median 

15.50 

: 2978 

Mean 

15.54 

: 3615 

3rd Qu . 

17.02 

: 5140 

Max . 

24.80 


year origin name 


Min . 

70.00 

Min . 

1.000 

amc matador 

1st Qu . 

73.00 

1st Qu . 

1.000 

ford pinto 

Median 

76.00 

Median 

1.000 

toyota corolla 

Mean 

75.98 

Mean 

1.577 

amc gremlin 

3rd Qu . 

79.00 

3rd Qu. 

2.000 

amc hornet 

Max . 

82.00 

Max . 

3.000 

Chevrolet chevette 

(Other) 


For qualitative variables such as name, R will list the number of observations 
that fall in each category. We can also produce a summary of just a single 
variable. 


> summary(mpg) 

Min. 1st Qu. Median 
9.00 17.00 22.75 


Mean 3rd Qu. 
23.45 29.00 


Max . 
46.60 


Once we have finished using R, we type q() in order to shut it down, or 
quit. When exiting R, we have the option to save the current workspace so 
that all objects (such as data sets) that we have created in this R session 
will be available next time. Before exiting R, we may want to save a record 
of all of the commands that we typed in the most recent session; this can 
be accomplished using the savehistoryO function. Next time we enter R, 
we can load that history using the loadhistory () function. 


qO 

workspace 


savehistoryO 
loadhistory() 
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2.4 Exercises 

Conceptual 

1. For each of parts (a) through (d), indicate whether we would generally 
expect the performance of a flexible statistical learning method to be 
better or worse than an inflexible method. Justify your answer. 

(a) The sample size n is extremely large, and the number of predic¬ 
tors p is small. 

(b) The number of predictors p is extremely large, and the number 
of observations n is small. 

(c) The relationship between the predictors and response is highly 
non-linear. 

(d) The variance of the error terms, i.e. a 2 = Var(e), is extremely 
high. 

2. Explain whether each scenario is a classification or regression prob¬ 
lem, and indicate whether we are most interested in inference or pre¬ 
diction. Finally, provide n and p. 

(a) We collect a set of data on the top 500 firms in the US. For each 
firm we record profit, number of employees, industry and the 
CEO salary. We are interested in understanding which factors 
affect CEO salary. 

(b) We are considering launching a new product and wish to know 
whether it will be a success or a failure. We collect data on 20 
similar products that were previously launched. For each prod¬ 
uct we have recorded whether it was a success or failure, price 
charged for the product, marketing budget, competition price, 
and ten other variables. 

(c) We are interesting in predicting the % change in the US dollar in 
relation to the weekly changes in the world stock markets. Hence 
we collect weekly data for all of 2012. For each week we record 
the % change in the dollar, the % change in the US market, 
the % change in the British market, and the % change in the 
German market. 

3. We now revisit the bias-variance decomposition. 

(a) Provide a sketch of typical (squared) bias, variance, training er¬ 
ror, test error, and Bayes (or irreducible) error curves, on a sin¬ 
gle plot, as we go from less flexible statistical learning methods 
towards more flexible approaches. The x-axis should represent 
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the amount of flexibility in the method, and the y-axis should 
represent the values for each curve. There should be five curves. 
Make sure to label each one. 

(b) Explain why each of the five curves has the shape displayed in 
part (a). 

4. You will now think of some real-life applications for statistical learn¬ 
ing. 

(a) Describe three real-life applications in which classification might 
be useful. Describe the response, as well as the predictors. Is the 
goal of each application inference or prediction? Explain your 
answer. 

(b) Describe three real-life applications in which regression might 
be useful. Describe the response, as well as the predictors. Is the 
goal of each application inference or prediction? Explain your 
answer. 

(c) Describe three real-life applications in which cluster analysis 
might be useful. 

5. What are the advantages and disadvantages of a very flexible (versus 
a less flexible) approach for regression or classification? Under what 
circumstances might a more flexible approach be preferred to a less 
flexible approach? When might a less flexible approach be preferred? 

6. Describe the differences between a parametric and a non-parametric 
statistical learning approach. What are the advantages of a para¬ 
metric approach to regression or classification (as opposed to a non- 
parametric approach)? What are its disadvantages? 

7. The table below provides a training data set containing six observa¬ 
tions, three predictors, and one qualitative response variable. 


Obs. 

Ad 

a 2 

A 3 

Y 

1 

0 

3 

0 

Red 

2 

2 

0 

0 

Red 

3 

0 

1 

3 

Red 

4 

0 

1 

2 

Green 

5 

-1 

0 

1 

Green 

6 

1 

1 

1 

Red 


Suppose we wish to use this data set to make a prediction for Y when 
X\ = X 2 = X 3 = 0 using RT-nearest neighbors. 

(a) Compute the Euclidean distance between each observation and 
the test point, X, = X 2 = X 3 = 0. 
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(b) What is our prediction with K =11 Why? 

(c) What is our prediction with K = 3? Why? 

(d) If the Bayes decision boundary in this problem is highly non¬ 
linear, then would we expect the best value for K to be large or 
small? Why? 

Applied 

8. This exercise relates to the College data set, which can be found in 
the file College. csv. It contains a number of variables for 777 different 
universities and colleges in the US. The variables are 

• Private : Public/private indicator 

• Apps : Number of applications received 

• Accept : Number of applicants accepted 

• Enroll : Number of new students enrolled 

• ToplOperc : New students from top 10 % of high school class 

• Top25perc : New students from top 25 % of high school class 

• F. Undergrad : Number of full-time undergraduates 

• P.Undergrad : Number of part-time undergraduates 

• Outstate : Out-of-state tuition 

• Room.Board : Room and board costs 

• Books : Estimated book costs 

• Personal : Estimated personal spending 

• PhD : Percent of faculty with Ph.D.’s 

• Terminal : Percent of faculty with terminal degree 

• S.F. Ratio : Student/faculty ratio 

• perc. alumni : Percent of alumni who donate 

• Expend : Instructional expenditure per student 

• Grad.Rate : Graduation rate 

Before reading the data into R, it can be viewed in Excel or a text 
editor. 

(a) Use the read.csvO function to read the data into R. Call the 
loaded data college. Make sure that you have the directory set 
to the correct location for the data. 

(b) Look at the data using the fix() function. You should notice 
that the first column is just the name of each university. We don’t 
really want R to treat this as data. However, it may be handy to 
have these names for later. Try the following commands: 


> rownames(college) = college [, 1] 

> fix(college) 
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You should see that there is now a row.names column with the 
name of each university recorded. This means that R has given 
each row a name corresponding to the appropriate university. R 
will not try to perform calculations on the row names. However, 
we still need to eliminate the first column in the data where the 
names are stored. Try 

> college = college [,-l] 

> fix(college ) 

Now you should see that the first data column is Private. Note 
that another column labeled row.names now appears before the 
Private column. However, this is not a data column but rather 
the name that R is giving to each row. 

i. Use the summary () function to produce a numerical summary 
of the variables in the data set. 

ii. Use the pairs() function to produce a scatterplot matrix of 
the first ten columns or variables of the data. Recall that 
you can reference the first ten columns of a matrix A using 
A[, 1:10]. 

iii. Use the plotO function to produce side-by-side boxplots of 
Outstate versus Private. 

iv. Create a new qualitative variable, called Elite, by binning 
the ToplOperc variable. We are going to divide universities 
into two groups based on whether or not the proportion 
of students coming from the top 10% of their high school 
classes exceeds 50 %. 

> Elite=rep("No",nrow(college)) 

> Elite [college$ToplOperc >50]="Yes" 

> Elite=as.factor(Elite) 

> college=data.frame(college.Elite) 

Use the summary () function to see how many elite univer¬ 
sities there are. Now use the plotO function to produce 
side-by-side boxplots of Outstate versus Elite. 

v. Use the hist() function to produce some histograms with 
differing numbers of bins for a few of the quantitative vari¬ 
ables. You may find the command par (mf row=c (2,2)) useful: 
it will divide the print window into four regions so that four 
plots can be made simultaneously. Modifying the arguments 
to this function will divide the screen in other ways. 

vi. Continue exploring the data, and provide a brief summary 
of what you discover. 
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9. This exercise involves the Auto data set studied in the lab. Make sure 
that the missing values have been removed from the data. 

(a) Which of the predictors are quantitative, and which are quali¬ 
tative? 

(b) What is the range of each quantitative predictor? You can an¬ 
swer this using the range () function. 

range() 

(c) What is the mean and standard deviation of each quantitative 
predictor? 

(d) Now remove the 10th through 85th observations. What is the 
range, mean, and standard deviation of each predictor in the 
subset of the data that remains? 

(e) Using the full data set, investigate the predictors graphically, 
using scatterplots or other tools of your choice. Create some plots 
highlighting the relationships among the predictors. Comment 
on your findings. 

(f) Suppose that we wish to predict gas mileage (mpg) on the basis 
of the other variables. Do your plots suggest that any of the 
other variables might be useful in predicting mpg? Justify your 
answer. 

10. This exercise involves the Boston housing data set. 

(a) To begin, load in the Boston data set. The Boston data set is 
part of the MASS library in R. 

> library( MASS ) 

Now the data set is contained in the object Boston. 

> Boston 

Read about the data set: 

> ?Boston 

How many rows are in this data set? How many columns? What 
do the rows and columns represent? 

(b) Make some pairwise scatterplots of the predictors (columns) in 
this data set. Describe your findings. 

(c) Are any of the predictors associated with per capita crime rate? 

If so, explain the relationship. 

(d) Do any of the suburbs of Boston appear to have particularly 
high crime rates? Tax rates? Pupil-teacher ratios? Comment on 
the range of each predictor. 

(e) How many of the suburbs in this data set bound the Charles 
river? 
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(f) What is the median pupil-teacher ratio among the towns in this 
data set? 

(g) Which suburb of Boston has lowest median value of owner- 
occupied homes? What are the values of the other predictors 
for that suburb, and how do those values compare to the overall 
ranges for those predictors? Comment on your findings. 

(h) In this data set, how many of the suburbs average more than 
seven rooms per dwelling? More than eight rooms per dwelling? 
Comment on the suburbs that average more than eight rooms 
per dwelling. 


3 

Linear Regression 


This chapter is about linear regression , a very simple approach for 
supervised learning. In particular, linear regression is a useful tool for pre¬ 
dicting a quantitative response. Linear regression has been around for a 
long time and is the topic of innumerable textbooks. Though it may seem 
somewhat dull compared to some of the more modern statistical learning 
approaches described in later chapters of this book, linear regression is still 
a useful and widely used statistical learning method. Moreover, it serves 
as a good jumping-off point for newer approaches: as we will see in later 
chapters, many fancy statistical learning approaches can be seen as gener¬ 
alizations or extensions of linear regression. Consequently, the importance 
of having a good understanding of linear regression before studying more 
complex learning methods cannot be overstated. In this chapter, we review 
some of the key ideas underlying the linear regression model, as well as the 
least squares approach that is most commonly used to fit this model. 

Recall the Advertising data from Chapter 2. Figure 2.1 displays sales 
(in thousands of units) for a particular product as a function of advertis¬ 
ing budgets (in thousands of dollars) for TV, radio, and newspaper media. 
Suppose that in our role as statistical consultants we are asked to suggest, 
on the basis of this data, a marketing plan for next year that will result in 
high product sales. What information would be useful in order to provide 
such a recommendation? Here are a few important questions that we might 
seek to address: 

1. Is there a relationship between advertising budget and sales? 

Our first goal should be to determine whether the data provide 
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evidence of an association between advertising expenditure and sales. 
If the evidence is weak, then one might argue that no money should 
be spent on advertising! 

2. How strong is the relationship between advertising budget and sales? 
Assuming that there is a relationship between advertising and sales, 
we would like to know the strength of this relationship. In other 
words, given a certain advertising budget, can we predict sales with 
a high level of accuracy? This would be a strong relationship. Or is 
a prediction of sales based on advertising expenditure only slightly 
better than a random guess? This would be a weak relationship. 

3. Which media contribute to sales? 

Do all three media—TV, radio, and newspaper—contribute to sales, 
or do just one or two of the media contribute? To answer this question, 
we must find a way to separate out the individual effects of each 
medium when we have spent money on all three media. 

4. How accurately can we estimate the effect of each medium on sales? 
For every dollar spent on advertising in a particular medium, by 
what amount will sales increase? How accurately can we predict this 
amount of increase? 

5. How accurately can we predict future sales? 

For any given level of television, radio, or newspaper advertising, what 
is our prediction for sales, and what is the accuracy of this prediction? 

6. Is the relationship linear? 

If there is approximately a straight-line relationship between advertis¬ 
ing expenditure in the various media and sales, then linear regression 
is an appropriate tool. If not, then it may still be possible to trans¬ 
form the predictor or the response so that linear regression can be 
used. 

7. Is there synergy among the advertising media? 

Perhaps spending $50,000 on television advertising and $50,000 on 
radio advertising results in more sales than allocating $100,000 to 
either television or radio individually. In marketing, this is known as 
a synergy effect, while in statistics it is called an interaction effect. 


It turns out that linear regression can be used to answer each of these 
questions. We will first discuss all of these questions in a general context, 
and then return to them in this specific context in Section 3.4. 


synergy 

interaction 


3.1 Simple Linear Regression 61 


3.1 Simple Linear Regression 

Simple linear regression lives up to its name: it is a very straightforward 
approach for predicting a quantitative response Y on the basis of a sin¬ 
gle predictor variable X. It assumes that there is approximately a linear 
relationship between X and Y. Mathematically, we can write this linear 
relationship as 

Y^Po + PiX. (3.1) 

You might read “w” as “is approximately modeled as”. We will sometimes 
describe (3.1) by saying that we are regressing Y on X (or Y onto X). 
For example, X may represent TV advertising and Y may represent sales. 
Then we can regress sales onto TV by fitting the model 

sales ss /3q + 0i x TV. 

In Equation 3.1, Po and pi are two unknown constants that represent 
the intercept and slope terms in the linear model. Together, po and pi are 
known as the model coefficients or parameters. Once we have used our 
training data to produce estimates 0q and 0i for the model coefficients, we 
can predict future sales on the basis of a particular value of TV advertising 
by computing 

V~P o + PiX, (3.2) 

where y indicates a prediction of Y on the basis of X = x. Here we use a 
hat symbol, ~ , to denote the estimated value for an unknown parameter 
or coefficient, or to denote the predicted value of the response. 

3.1.1 Estimating the Coefficients 

In practice, Po and 0i are unknown. So before we can use (3.1) to make 
predictions, we must use data to estimate the coefficients. Let 

(xi,2/i), (x 2 ,y 2 ),..., ( x n ,y n ) 

represent n observation pairs, each of which consists of a measurement 
of X and a measurement of Y. In the Advertising example, this data 
set consists of the TV advertising budget and product sales in n = 200 
different markets. (Recall that the data are displayed in Figure 2.1.) Our 
goal is to obtain coefficient estimates 0o and Pi such that the linear model 
(3.1) fits the available data well—that is, so that yi « 0 O + 0iXi for i = 
1,..., n. In other words, we want to find an intercept 0o and a slope 0i such 
that the resulting line is as close as possible to the n = 200 data points. 
There are a number of ways of measuring closeness. However, by far the 
most common approach involves minimizing the least squares criterion, 
and we take that approach in this chapter. Alternative approaches will be 
considered in Chapter 6. 


simple linear 
regression 


intercept 

slope 

coefficient 

parameter 


least squares 
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FIGURE 3.1. For the Advertising data, the least squares fit for the regression 
of sales onto TV is shown. The fit is found by minimizing the sum of squared 
errors. Each grey line segment represents an error, and the fit makes a compro¬ 
mise by averaging their squares. In this case a linear fit captures the essence of 
the relationship, although it is somewhat deficient in the left of the plot. 


Let fji = 1 3o + Xi be the prediction for Y based on the ith value of X. 
Then e* = j/j — y,; represents the ith residual —this is the difference between 
the ith observed response value and the ith response value that is predicted 
by our linear model. We define the residual sum of squares (RSS) as 

RSS = ej + e\ + • • • + e^, 


residual 


residual sum 
of squares 


or equivalently as 

RSS = (yi-fio-Pixi) 2 + {y2-Po-^iX2) 2 + --- + {yn-Po-Pix n ) 2 - (3.3) 

The least squares approach chooses /3 q and f3\ to minimize the RSS. Using 
some calculus, one can show that the minimizers are 

o = YS=i{xi ~ x){yi - y) 

Pl EIUOu-*) 2 ’ (3.4) 

/3o = y - Pix, 

where y = ^ EEi Vi an< ^ * — n E"=i x i are the sample means. In other 
words, (3.4) defines the least squares coefficient estimates for simple linear 
regression. 

Figure 3.1 displays the simple linear regression fit to the Advertising 
data, where /3 q = 7.03 and j3\ = 0.0475. In other words, according to 
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FIGURE 3.2. Contour and three-dimensional plots of the RSS on the 
Advertising data, using sales as the response and TV as the predictor. The 
red dots correspond to the least squares estimates j3o and /? 1 , given by (3.f). 

this approximation, an additional $1,000 spent on TV advertising is asso¬ 
ciated with selling approximately 47.5 additional units of the product. In 
Figure 3.2, we have computed RSS for a number of values of /3o and /3i, 
using the advertising data with sales as the response and TV as the predic¬ 
tor. In each plot, the red dot represents the pair of least squares estimates 
(/3o,/3i) given by (3.4). These values clearly minimize the RSS. 

3.1.2 Assessing the Accuracy of the Coefficient Estimates 

Recall from (2.1) that we assume that the true relationship between X and 
Y takes the form Y = f(X) + e for some unknown function /, where e 
is a mean-zero random error term. If / is to be approximated by a linear 
function, then we can write this relationship as 

Y = fi 0 + P 1 X + e. (3.5) 

Here /3 0 is the intercept term—that is, the expected value of Y when X = 0, 
and /?i is the slope -the average increase in Y associated with a one-unit 
increase in X. The error term is a catch-all for what we miss with this 
simple model: the true relationship is probably not linear, there may be 
other variables that cause variation in Y , and there may be measurement 
error. We typically assume that the error term is independent of X. 

The model given by (3.5) defines the population regression line , which 
is the best linear approximation to the true relationship between X and 
Y} The least squares regression coefficient estimates (3.4) characterize the 
least squares line (3.2). The left-hand panel of Figure 3.3 displays these 


1 The assumption of linearity is often a useful working model. However, despite what 
many textbooks might tell us, we seldom believe that the true relationship is linear. 
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x x 

FIGURE 3.3. A simulated data set. Left: The red line represents the true rela¬ 
tionship, f(X) = 2 + 3X, which is known as the population regression line. The 
blue line is the least squares line; it is the least squares estimate for f(X) based 
on the observed data, shown in black. Right: The population regression line is 
again shown in red, and the least squares line in dark blue. In light blue, ten least 
squares lines are shown, each computed on the basis of a separate random set of 
observations. Each least squares line is different, but on average, the least squares 
lines are quite close to the population regression line. 


two lines in a simple simulated example. We created 100 random Xs, and 
generated 100 corresponding Ys from the model 

Y = 2 + 3X + e, (3.6) 

where e was generated from a normal distribution with mean zero. The 
red line in the left-hand panel of Figure 3.3 displays the true relationship. 
f(X) = 2 + 3X, while the blue line is the least squares estimate based 
on the observed data. The true relationship is generally not known for 
real data, but the least squares line can always be computed using the 
coefficient estimates given in (3.4). In other words, in real applications, 
we have access to a set of observations from which we can compute the 
least squares line; however, the population regression line is unobserved. 
In the right-hand panel of Figure 3.3 we have generated ten different data 
sets from the model given by (3.6) and plotted the corresponding ten least 
squares lines. Notice that different data sets generated from the same true 
model result in slightly different least squares lines, but the unobserved 
population regression line does not change. 

At first glance, the difference between the population regression line and 
the least squares line may seem subtle and confusing. We only have one 
data set, and so what does it mean that two different lines describe the 
relationship between the predictor and the response? Fundamentally, the 
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concept of these two lines is a natural extension of the standard statistical 
approach of using information from a sample to estimate characteristics of a 
large population. For example, suppose that we are interested in knowing 
the population mean y of some random variable Y. Unfortunately, y is 
unknown, but we do have access to n observations from Y. which we can 
write as yi,...,y n , and which we can use to estimate y. A reasonable 
estimate is p = y, where y = ^ ]C"=i 2d * s the sam pl e mean. The sample 
mean and the population mean are different, but in general the sample 
mean will provide a good estimate of the population mean. In the same 
way, the unknown coefficients (3q and /3i in linear regression define the 
population regression line. We seek to estimate these unknown coefficients 
using /3o and j3\ given in (3.4). These coefficient estimates define the least 
squares line. 

The analogy between linear regression and estimation of the mean of a 
random variable is an apt one based on the concept of bias. If we use the 
sample mean p to estimate y, this estimate is unbiased , in the sense that 
on average, we expect p to equal /i. What exactly does this mean? It means 
that on the basis of one particular set of observations y\, ..., y n , p might 
overestimate y, and on the basis of another set of observations, p might 
underestimate y. But if we could average a huge number of estimates of 
y obtained from a huge number of sets of observations, then this average 
would exactly equal y. Hence, an unbiased estimator does not systematically 
over- or under-estimate the true parameter. The property of unbiasedness 
holds for the least squares coefficient estimates given by (3.4) as well: if 
we estimate /?o and /3i on the basis of a particular data set, then our 
estimates won’t be exactly equal to fto and /3i. But if we could average 
the estimates obtained over a huge number of data sets, then the average 
of these estimates would be spot on! In fact, we can see from the right- 
hand panel of Figure 3.3 that the average of many least squares lines, each 
estimated from a separate data set, is pretty close to the true population 
regression line. 

We continue the analogy with the estimation of the population mean 
y of a random variable Y. A natural question is as follows: how accurate 
is the sample mean p as an estimate of yl We have established that the 
average of p's over many data sets will be very close to y, but that a 
single estimate p may be a substantial underestimate or overestimate of y. 
How far off will that single estimate of p be? In general, we answer this 
question by computing the standard error of p, written as SE(/t). We have 
the well-known formula 


Var(/i) = SE(/i) 2 


1 

n 


(3.7) 


bias 

unbiased 


standard 

error 
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where a is the standard deviation of each of the realizations yi of Y ? 
Roughly speaking, the standard error tells us the average amount that this 
estimate p differs from the actual value of y. Equation 3.7 also tells us how 
this deviation shrinks with n —the more observations we have, the smaller 
the standard error of p. In a similar vein, we can wonder how close fio 
and fix are to the true values fio and fix- To compute the standard errors 
associated with fio and fix, we use the following formulas: 


SE(/3q) = o 1 


Er=i(^-*) 2 3 J 


SE(ft) = 


EILi (^-s) 2 ’ 


(3.8) 


where a 2 = Var(e). For these formulas to be strictly valid, we need to as¬ 
sume that the errors for each observation are uncorrelated with common 
variance a 2 . This is clearly not true in Figure 3.1, but the formula still 
turns out to be a good approximation. Notice in the formula that SE(/Ii) is 
smaller when the Xi are more spread out; intuitively we have more leverage 
to estimate a slope when this is the case. We also see that SE(/3 0 ) would be 
the same as SE (p) if x were zero (in which case fio would be equal to y ). 
In general, a 2 is not known, but can be estimated from the data. This esti¬ 
mate is known as the residual standard error , and is given by the formula 
RSE = i/RSS/(n — 2). Strictly speaking, when a 2 is estimated from the 
data we should write SE(/3i) to indicate that an estimate has been made, 
but for simplicity of notation we will drop this extra “hat”. 

Standard errors can be used to compute confidence intervals. A 95 % 
confidence interval is defined as a range of values such that with 95 % 
probability, the range will contain the true unknown value of the parameter. 
The range is defined in terms of lower and upper limits computed from the 
sample of data. For linear regression, the 95% confidence interval for fix 
approximately takes the form 


residual 

standard 

error 


confidence 

interval 


/3 1 ±2-SE(/3 1 ). 


(3.9) 


That is, there is approximately a 95 % chance that the interval 
fix — 2 • SE(/3i), fix+ 2- SE(/3 X ) 


(3.10) 


will contain the true value of Similarly, a confidence interval for /3 q 
approximately takes the form 

A)±2.SE(/9 0 ). (3.11) 


2 This formula holds provided that the n observations are uncorrelated. 

3 Approximately for several reasons. Equation 3.10 relies on the assumption that the 
errors are Gaussian. Also, the factor of 2 in front of the SE(/3i) term will vary slightly 
depending on the number of observations n in the linear regression. To be precise, rather 
than the number 2, (3.10) should contain the 97.5% quantile of a ^-distribution with 
n — 2 degrees of freedom. Details of how to compute the 95 % confidence interval precisely 
in R will be provided later in this chapter. 
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In the case of the advertising data, the 95 % confidence interval for Po 
is [6.130,7.935] and the 95% confidence interval for Pi is [0.042,0.053]. 
Therefore, we can conclude that in the absence of any advertising, sales will, 
on average, fall somewhere between 6,130 and 7,940 units. Furthermore, 
for each $1,000 increase in television advertising, there will be an average 
increase in sales of between 42 and 53 units. 

Standard errors can also be used to perform hypothesis tests on the 
coefficients. The most common hypothesis test involves testing the null 
hypothesis of 


Hq : There is no relationship between X and Y (3-12) 

versus the alternative hypothesis 

H a : There is some relationship between X and Y. (3.13) 
Mathematically, this corresponds to testing 


H 0 : Pi = 0 


versus 

H a :p 0, 


since if /3i = 0 then the model (3.5) reduces to Y = P 0 + e, and X is 
not associated with Y. To test the null hypothesis, we need to determine 
whether pi, our estimate for Pi, is sufficiently far from zero that we can 
be confident that Pi is non-zero. How far is far enough? This of course 
depends on the accuracy of Pi —that is, it depends on SE(/3i). If SE(/?i) is 
small, then even relatively small values of Pi may provide strong evidence 
that pi ^ 0, and hence that there is a relationship between X and Y. In 
contrast, if SE(/3i) is large, then Pi must be large in absolute value in order 
for us to reject the null hypothesis. In practice, we compute a t-statistic, 
given by 


fii-0 
SE CPiY 


(3.14) 


which measures the number of standard deviations that P\ is away from 
0. If there really is no relationship between X and Y , then we expect 
that (3.14) will have a f-distribution with n — 2 degrees of freedom. The t- 
distribution has a bell shape and for values of n greater than approximately 
30 it is quite similar to the normal distribution. Consequently, it is a simple 
matter to compute the probability of observing any value equal to |t| or 
larger, assuming Pi = 0. We call this probability the p-value. Roughly 
speaking, we interpret the p-value as follows: a small p-value indicates that 
it is unlikely to observe such a substantial association between the predictor 
and the response due to chance, in the absence of any real association 
between the predictor and the response. Hence, if we see a small p-value, 


hypothesis 

test 

null 

hypothesis 


alternative 

hypothesis 


t-statistic 


p-value 
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then we can infer that there is an association between the predictor and the 
response. We reject the null hypothesis —that is, we declare a relationship 
to exist between X and Y —if the p-value is small enough. Typical p-value 
cutoffs for rejecting the null hypothesis are 5 or 1 %. When n = 30, these 
correspond to t-statistics (3.14) of around 2 and 2.75, respectively. 



Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

7.0325 

0.4578 

15.36 

< 0.0001 

TV 

0.0475 

0.0027 

17.67 

< 0.0001 


TABLE 3.1. For the Advertising data, coefficients of the least squares model 
for the regression of number of units sold on TV advertising budget. An increase 
of $1,000 in the TV advertising budget is associated with an increase in sales by 
around 50 units (Recall that the sales variable is in thousands of units, and the 
TV variable is in thousands of dollars). 

Table 3.1 provides details of the least squares model for the regression of 
number of units sold on TV advertising budget for the Advertising data. 
Notice that the coefficients for po and pi are very large relative to their 
standard errors, so the t-statistics are also large; the probabilities of seeing 
such values if Hq is true are virtually zero. Hence we can conclude that 
Po A 0 and pi ^ 0. 4 

3.1.3 Assessing the Accuracy of the Model 

Once we have rejected the null hypothesis (3.12) in favor of the alternative 
hypothesis (3.13), it is natural to want to quantify the extent to which the 
model fits the data. The quality of a linear regression fit is typically assessed 
using two related quantities: the residual standard error (RSE) and the R 2 
statistic. 

Table 3.2 displays the RSE, the R 2 statistic, and the F-statistic (to be 
described in Section 3.2.2) for the linear regression of number of units sold 
on TV advertising budget. 

Residual Standard Error 

Recall from the model (3.5) that associated with each observation is an 
error term e. Due to the presence of these error terms, even if we knew the 
true regression line (i.e. even if Po and pi were known), we would not be 
able to perfectly predict Y from A'. The RSE is an estimate of the standard 


4 In Table 3.1, a small p-value for the intercept indicates that we can reject the null 
hypothesis that /3o = 0, and a small p-value for TV indicates that we can reject the null 
hypothesis that = 0. Rejecting the latter null hypothesis allows us to conclude that 
there is a relationship between TV and sales. Rejecting the former allows us to conclude 
that in the absence of TV expenditure, sales are non-zero. 
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Quantity 


Value 


Residual standard error 
R 2 

F-statistic 


3.26 

0.612 

312.1 


TABLE 3.2. For the Advertising data, more information about the least squares 
model for the regression of number of units sold on TV advertising budget. 

deviation of e. Roughly speaking, it is the average amount that the response 
will deviate from the true regression line. It is computed using the formula 



(3.15) 


Note that RSS was defined in Section 3.1.1, and is given by the formula 


n 



(3.16) 


In the case of the advertising data, we see from the linear regression 
output in Table 3.2 that the RSE is 3.26. In other words, actual sales in 
each market deviate from the true regression line by approximately 3,260 
units, on average. Another way to think about this is that even if the 
model were correct and the true values of the unknown coefficients /3o 
and /?i were known exactly, any prediction of sales on the basis of TV 
advertising would still be off by about 3,260 units on average. Of course, 
whether or not 3,260 units is an acceptable prediction error depends on the 
problem context. In the advertising data set, the mean value of sales over 
all markets is approximately 14,000 units, and so the percentage error is 
3,260/14,000 = 23%. 

The RSE is considered a measure of the lack of fit of the model (3.5) to 
the data. If the predictions obtained using the model are very close to the 
true outcome values—that is, if iji ss yi for i = 1,... ,n —then (3.15) will 
be small, and we can conclude that the model fits the data very well. On 
the other hand, if yi is very far from yi for one or more observations, then 
the RSE may be quite large, indicating that the model doesn’t fit the data 
well. 

R 2 Statistic 

The RSE provides an absolute measure of lack of fit of the model (3.5) 
to the data. But since it is measured in the units of Y , it is not always 
clear what constitutes a good RSE. The R 2 statistic provides an alternative 
measure of fit. It takes the form of a proportion —the proportion of variance 
explained—and so it always takes on a value between 0 and 1, and is 
independent of the scale of Y. 
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To calculate R 2 , we use the formula 


R 2 


TSS - RSS RSS 

TSS " TSS 


(3.17) 


where TSS = XXz/i — y ) 2 is the total sum of squares , and RSS is defined 
in (3.16). TSS measures the total variance in the response Y, and can be 
thought of as the amount of variability inherent in the response before the 
regression is performed. In contrast, RSS measures the amount of variability 
that is left unexplained after performing the regression. Hence, TSS — RSS 
measures the amount of variability in the response that is explained (or 
removed) by performing the regression, and R 2 measures the proportion 
of variability in Y that can be explained using X. An R 2 statistic that is 
close to 1 indicates that a large proportion of the variability in the response 
has been explained by the regression. A number near 0 indicates that the 
regression did not explain much of the variability in the response; this might 
occur because the linear model is wrong, or the inherent error a 2 is high, 
or both. In Table 3.2, the R 2 was 0.61, and so just under two-thirds of the 
variability in sales is explained by a linear regression on TV. 

The R 2 statistic (3.17) has an interpretational advantage over the RSE 
(3.15), since unlike the RSE, it always lies between 0 and 1. However, it can 
still be challenging to determine what is a good R 2 value, and in general, 
this will depend on the application. For instance, in certain problems in 
physics, we may know that the data truly comes from a linear model with 
a small residual error. In this case, we would expect to see an R 2 value that 
is extremely close to 1, and a substantially smaller R 2 value might indicate a 
serious problem with the experiment in which the data were generated. On 
the other hand, in typical applications in biology, psychology, marketing, 
and other domains, the linear model (3.5) is at best an extremely rough 
approximation to the data, and residual errors due to other unmeasured 
factors are often very large. In this setting, we would expect only a very 
small proportion of the variance in the response to be explained by the 
predictor, and an R 2 value well below 0.1 might be more realistic! 

The R 2 statistic is a measure of the linear relationship between X and 
Y. Recall that correlation , defined as 


Cor(X, Y) 


- x){yj - y) 

\/E;=i(*i-*)VE Uivi-y) 2 ' 


(3.18) 


is also a measure of the linear relationship between X and Y. 5 This sug¬ 
gests that we might be able to use r = Cor(A", Y) instead of R 2 in order to 
assess the fit of the linear model. In fact, it can be shown that in the simple 
linear regression setting, R 2 = r 2 . In other words, the squared correlation 


5 We note that in fact, the right-hand side of (3.18) is the sample correlation; thus, 
it would be more correct to write Cor(7f, Y)\ however, we omit the “hat” for ease of 
notation. 


total sum of 
squares 


correlation 
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and the R 2 statistic are identical. However, in the next section we will 
discuss the multiple linear regression problem, in which we use several pre¬ 
dictors simultaneously to predict the response. The concept of correlation 
between the predictors and the response does not extend automatically to 
this setting, since correlation quantifies the association between a single 
pair of variables rather than between a larger number of variables. We will 
see that R 2 fills this role. 

3.2 Multiple Linear Regression 

Simple linear regression is a useful approach for predicting a response on the 
basis of a single predictor variable. However, in practice we often have more 
than one predictor. For example, in the Advertising data, we have examined 
the relationship between sales and TV advertising. We also have data for 
the amount of money spent advertising on the radio and in newspapers, 
and we may want to know whether either of these two media is associated 
with sales. How can we extend our analysis of the advertising data in order 
to accommodate these two additional predictors? 

One option is to run three separate simple linear regressions, each of 
which uses a different advertising medium as a predictor. For instance, 
we can fit a simple linear regression to predict sales on the basis of the 
amount spent on radio advertisements. Results are shown in Table 3.3 (top 
table). We find that a $1,000 increase in spending on radio advertising is 
associated with an increase in sales by around 203 units. Table 3.3 (bottom 
table) contains the least squares coefficients for a simple linear regression of 
sales onto newspaper advertising budget. A $1,000 increase in newspaper 
advertising budget is associated with an increase in sales by approximately 
55 units. 

However, the approach of fitting a separate simple linear regression model 
for each predictor is not entirely satisfactory. First of all, it is unclear how to 
make a single prediction of sales given levels of the three advertising media 
budgets, since each of the budgets is associated with a separate regression 
equation. Second, each of the three regression equations ignores the other 
two media in forming estimates for the regression coefficients. We will see 
shortly that if the media budgets are correlated with each other in the 200 
markets that constitute our data set, then this can lead to very misleading 
estimates of the individual media effects on sales. 

Instead of fitting a separate simple linear regression model for each pre¬ 
dictor, a better approach is to extend the simple linear regression model 
(3.5) so that it can directly accommodate multiple predictors. We can do 
this by giving each predictor a separate slope coefficient in a single model. 
In general, suppose that we have p distinct predictors. Then the multiple 
linear regression model takes the form 


Y — /3q + /3iXi + P 2 X 2 + • ■ • + PpXp + e, 


(3.19) 
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Simple regression of sales on radio 



Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

9.312 

0.563 

16.54 

< 0.0001 

radio 

0.203 

0.020 

9.92 

< 0.0001 


Simple regression of sales on newspaper 



Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

12.351 

0.621 

19.88 

< 0.0001 

newspaper 

0.055 

0.017 

3.30 

< 0.0001 


TABLE 3.3. More simple linear regression models for the Advertising data. Co¬ 
efficients of the simple linear regression model for number of units sold on Top: 
radio advertising budget and Bottom: newspaper advertising budget. A $1,000 in¬ 
crease in spending on radio advertising is associated with an average increase in 
sales by around 203 units, while the same increase in spending on newspaper ad¬ 
vertising is associated with an average increase in sales by around 55 units (Note 
that the sales variable is in thousands of units, and the radio and newspaper 
variables are in thousands of dollars). 

where Xj represents the jth predictor and /3j quantifies the association 
between that variable and the response. We interpret P 3 as the average 
effect on Y of a one unit increase in Xj , holding all other predictors fixed. 
In the advertising example, (3.19) becomes 

sales = /3 q + /3i X TV -(- /32 X radio X newspaper + e. (3.20) 


3.2.1 Estimating the Regression Coefficients 

As was the case in the simple linear regression setting, the regression coef¬ 
ficients Po, pi,..., P p in (3.19) are unknown, and must be estimated. Given 
estimates po, pi,..., p p , we can make predictions using the formula 

V = Po + P 1 X 1 + p 2 x 2 H - b PpX p . (3-21) 

The parameters are estimated using the same least squares approach that 
we saw in the context of simple linear regression. We choose Po, Pi,..., p p 
to minimize the sum of squared residuals 

n 

= ^ZiVi - Vi? 

i =1 
n 

= - Po - Pi^n - p2Xa - PpXip) 2 . (3.22) 

i= 1 


RSS 
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FIGURE 3.4. In a three-dimensional setting, with two predictors and one re¬ 
sponse, the least squares regression line becomes a plane. The plane is chosen 
to minimize the sum of the squared vertical distances between each observation 
(shown in red) and the plane. 


The values /3 q, /?i,..., (3 P that minimize (3.22) are the multiple least squares 
regression coefficient estimates. Unlike the simple linear regression 
estimates given in (3.4), the multiple regression coefficient estimates have 
somewhat complicated forms that are most easily represented using ma¬ 
trix algebra. For this reason, we do not provide them here. Any statistical 
software package can be used to compute these coefficient estimates, and 
later in this chapter we will show how this can be done in R. Figure 3.4 
illustrates an example of the least squares fit to a toy data set with p = 2 
predictors. 

Table 3.4 displays the multiple regression coefficient estimates when TV, 
radio, and newspaper advertising budgets are used to predict product sales 
using the Advertising data. We interpret these results as follows: for a given 
amount of TV and newspaper advertising, spending an additional $1,000 
on radio advertising leads to an increase in sales by approximately 189 
units. Comparing these coefficient estimates to those displayed in Tables 3.1 
and 3.3, we notice that the multiple regression coefficient estimates for 
TV and radio are pretty similar to the simple linear regression coefficient 
estimates. However, while the newspaper regression coefficient estimate in 
Table 3.3 was significantly non-zero, the coefficient estimate for newspaper 
in the multiple regression model is close to zero, and the corresponding 
p-value is no longer significant, with a value around 0.86. This illustrates 
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Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

2.939 

0.3119 

9.42 

< 0.0001 

TV 

0.046 

0.0014 

32.81 

< 0.0001 

radio 

0.189 

0.0086 

21.89 

< 0.0001 

newspaper 

-0.001 

0.0059 

-0.18 

0.8599 


TABLE 3.4. For the Advertising data, least squares coefficient estimates of the 
multiple linear regression of number of units sold on radio, TV, and newspaper 
advertising budgets. 

that the simple and multiple regression coefficients can be quite different. 
This difference stems from the fact that in the simple regression case, the 
slope term represents the average effect of a $1,000 increase in newspaper 
advertising, ignoring other predictors such as TV and radio. In contrast, in 
the multiple regression setting, the coefficient for newspaper represents the 
average effect of increasing newspaper spending by $1,000 while holding TV 
and radio fixed. 

Does it make sense for the multiple regression to suggest no relationship 
between sales and newspaper while the simple linear regression implies the 
opposite? In fact it does. Consider the correlation matrix for the three 
predictor variables and response variable, displayed in Table 3.5. Notice 
that the correlation between radio and newspaper is 0.35. This reveals a 
tendency to spend more on newspaper advertising in markets where more 
is spent on radio advertising. Now suppose that the multiple regression is 
correct and newspaper advertising has no direct impact on sales, but radio 
advertising does increase sales. Then in markets where we spend more 
on radio our sales will tend to be higher, and as our correlation matrix 
shows, we also tend to spend more on newspaper advertising in those same 
markets. Hence, in a simple linear regression which only examines sales 
versus newspaper, we will observe that higher values of newspaper tend to be 
associated with higher values of sales, even though newspaper advertising 
does not actually affect sales. So newspaper sales are a surrogate for radio 
advertising; newspaper gets “credit” for the effect of radio on sales. 

This slightly counterintuitive result is very common in many real life 
situations. Consider an absurd example to illustrate the point. Running 
a regression of shark attacks versus ice cream sales for data collected at 
a given beach community over a period of time would show a positive 
relationship, similar to that seen between sales and newspaper. Of course 
no one (yet) has suggested that ice creams should be banned at beaches 
to reduce shark attacks. In reality, higher temperatures cause more people 
to visit the beach, which in turn results in more ice cream sales and more 
shark attacks. A multiple regression of attacks versus ice cream sales and 
temperature reveals that, as intuition implies, the former predictor is no 
longer significant after adjusting for temperature. 
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TV radio newspaper sales 


newspaper 

sales 


TV 

radio 


1.0000 0.0548 0.0567 0.7822 

1.0000 0.3541 0.5762 

1.0000 0.2283 

1.0000 


TABLE 3.5. Correlation matrix for TV, radio, newspaper, and sales for the 
Advertising data. 

3.2.2 Some Important Questions 

When we perform multiple linear regression, we usually are interested in 
answering a few important questions. 

1. Is at least one of the predictors Xi,X 2 , ■ .., X p useful in predicting 
the response? 

2. Do all the predictors help to explain Y, or is only a subset of the 
predictors useful? 

3. How well does the model fit the data? 

4. Given a set of predictor values, what response value should we predict, 
and how accurate is our prediction? 

We now address each of these questions in turn. 

One: Is There a Relationship Between the Response and Predictors? 

Recall that in the simple linear regression setting, in order to determine 
whether there is a relationship between the response and the predictor we 
can simply check whether f3\ = 0. In the multiple regression setting with p 
predictors, we need to ask whether all of the regression coefficients are zero, 
i.e. whether /3i = /?2 = • • • = (3 P = 0. As in the simple linear regression 
setting, we use a hypothesis test to answer this question. We test the null 
hypothesis, 


Ho '■ Pi — P2 



versus the alternative 


H a : at least one [ij is non-zero. 


This hypothesis test is performed by computing the F-statistic, 


F-statistic 


(TSS - RSS)/p 
RSS/(n — p — 1) ’ 


(3.23) 
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Quantity 

Value 

Residual standard error 

1.69 

R 2 

0.897 

F-statistic 

570 


TABLE 3.6. More information about the least squares model for the regression 
of number of units sold on TV, newspaper, and radio advertising budgets in the 
Advertising data. Other information about this model was displayed in Table 3-4- 

where, as with simple linear regression, TSS = ]T)(t/i — y) 2 and RSS = 
y )(yi — yi) 2 . If the linear model assumptions are correct, one can show that 

E{RSS/(n-p- 1)} = a 2 
and that, provided Ho is true, 

E{(TSS-RSS)/p} =a 2 . 

Hence, when there is no relationship between the response and predictors, 
one would expect the F-statistic to take on a value close to 1. On the other 
hand, if H a is true, then H{(TSS — RSS)/p} > a 2 , so we expect F to be 
greater than 1. 

The F-statistic for the multiple linear regression model obtained by re¬ 
gressing sales onto radio, TV, and newspaper is shown in Table 3.6. In this 
example the F-statistic is 570. Since this is far larger than 1, it provides 
compelling evidence against the null hypothesis Hq. In other words, the 
large F-statistic suggests that at least one of the advertising media must 
be related to sales. However, what if the F-statistic had been closer to 
1? How large does the F-statistic need to be before we can reject Hq and 
conclude that there is a relationship? It turns out that the answer depends 
on the values of n and p. When n is large, an F-statistic that is just a 
little larger than 1 might still provide evidence against Hq. In contrast, 
a larger F-statistic is needed to reject Hq if n is small. When Hq is true 
and the errors Cj have a normal distribution, the F-statistic follows an 
F-distribution. 6 For any given value of n and p , any statistical software 
package can be used to compute the p-value associated with the F-statistic 
using this distribution. Based on this p-value, we can determine whether 
or not to reject Hq. For the advertising data, the p-value associated with 
the F-statistic in Table 3.6 is essentially zero, so we have extremely strong 
evidence that at least one of the media is associated with increased sales. 

In (3.23) we are testing Hq that all the coefficients are zero. Sometimes 
we want to test that a particular subset of q of the coefficients are zero. 
This corresponds to a null hypothesis 

Ho . {3p-q-\-l = {3p—q +2 = • - ■ = ftp 6 , 


6 Even if the errors are not normally-distributed, the F-statistic approximately follows 
an F-distribution provided that the sample size n is large. 
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where for convenience we have put the variables chosen for omission at the 
end of the list. In this case we fit a second model that uses all the variables 
except those last q. Suppose that the residual sum of squares for that model 
is RSSq. Then the appropriate F-statistic is 


(RSS 0 - RSS)/g 
RSS /(n-p-1)' 


(3.24) 


Notice that in Table 3.4, for each individual predictor a t-statistic and 
a p-value were reported. These provide information about whether each 
individual predictor is related to the response, after adjusting for the other 
predictors. It turns out that each of these are exactly equivalent 7 to the 
F-test that omits that single variable from the model, leaving all the others 
in—i.e. q= 1 in (3.24). So it reports the partial effect of adding that variable 
to the model. For instance, as we discussed earlier, these p-values indicate 
that TV and radio are related to sales, but that there is no evidence that 
newspaper is associated with sales, in the presence of these two. 

Given these individual p-values for each variable, why do we need to look 
at the overall F-statistic? After all, it seems likely that if any one of the 
p-values for the individual variables is very small, then at least one of the 
predictors is related to the response. However, this logic is flawed, especially 
when the number of predictors p is large. 

For instance, consider an example in which p = 100 and Hq : /3\ = fa = 
... = j3 p = 0 is true, so no variable is truly associated with the response. In 
this situation, about 5 % of the p-values associated with each variable (of 
the type shown in Table 3.4) will be below 0.05 by chance. In other words, 
we expect to see approximately five small p-values even in the absence of 
any true association between the predictors and the response. In fact, we 
are almost guaranteed that wc will observe at least one p-value below 0.05 
by chance! Hence, if we use the individual t-statistics and associated p- 
values in order to decide whether or not there is any association between 
the variables and the response, there is a very high chance that we will 
incorrectly conclude that there is a relationship. However, the F-statistic 
does not suffer from this problem because it adjusts for the number of 
predictors. Hence, if Hq is true, there is only a 5% chance that the F- 
statistic will result in a p-value below 0.05, regardless of the number of 
predictors or the number of observations. 

The approach of using an F-statistic to test for any association between 
the predictors and the response works when p is relatively small, and cer¬ 
tainly small compared to n. However, sometimes we have a very large num¬ 
ber of variables. If p > n then there are more coefficients /3j to estimate 
than observations from which to estimate them. In this case we cannot 
even fit the multiple linear regression model using least squares, so the 


7 The square of each t-statistic is the corresponding F-statistic. 
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F-statistic cannot be used, and neither can most of the other concepts that 
we have seen so far in this chapter. When p is large, some of the approaches 
discussed in the next section, such as forward selection , can be used. This 
high-dimensional setting is discussed in greater detail in Chapter 6. 


Two: Deciding on Important Variables 

As discussed in the previous section, the first step in a multiple regression 
analysis is to compute the F-statistic and to examine the associated p- 
value. If we conclude on the basis of that p-value that at least one of the 
predictors is related to the response, then it is natural to wonder which are 
the guilty ones! We could look at the individual p-values as in Table 3.4, 
but as discussed, if p is large we are likely to make some false discoveries. 

It is possible that all of the predictors are associated with the response, 
but it is more often the case that the response is only related to a subset of 
the predictors. The task of determining which predictors are associated with 
the response, in order to fit a single model involving only those predictors, 
is referred to as variable selection. The variable selection problem is studied 
extensively in Chapter 6, and so here we will provide only a brief outline 
of some classical approaches. 

Ideally, we would like to perform variable selection by trying out a lot of 
different models, each containing a different subset of the predictors. For 
instance, if p — 2, then we can consider four models: (1) a model contain¬ 
ing no variables, (2) a model containing X\ only, (3) a model containing 
X 2 only, and (4) a model containing both X\ and X 2 . We can then se¬ 
lect the best model out of all of the models that we have considered. How 
do we determine which model is best? Various statistics can be used to 
judge the quality of a model. These include Mallow’s C p , Akaike informa¬ 
tion criterion (AIC), Bayesian information criterion (BIC), and adjusted 
R 2 . These are discussed in more detail in Chapter 6. We can also deter¬ 
mine which model is best by plotting various model outputs, such as the 
residuals, in order to search for patterns. 

Unfortunately, there are a total of 2 P models that contain subsets of p 
variables. This means that even for moderate p, trying out every possible 
subset of the predictors is infeasible. For instance, we saw that if p = 2, then 
there are 2 2 = 4 models to consider. But if p = 30, then we must consider 
2 30 = 1,073,741,824 models! This is not practical. Therefore, unless p is very 
small, we cannot consider all 2 P models, and instead we need an automated 
and efficient approach to choose a smaller set of models to consider. There 
are three classical approaches for this task: 

• Forward selection. We begin with the null model —a model that con¬ 
tains an intercept but no predictors. We then fit p simple linear re¬ 
gressions and add to the null model the variable that results in the 
lowest RSS. We then add to that model the variable that results 
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in the lowest RSS for the new two-variable model. This approach is 
continued until some stopping rule is satisfied. 

• Backward selection. We start with all variables in the model, and 
remove the variable with the largest p-value -that is, the variable 
that is the least statistically significant. The new (p — Invariable 
model is fit, and the variable with the largest p-value is removed. This 
procedure continues until a stopping rule is reached. For instance, we 
may stop when all remaining variables have a p-value below some 
threshold. 

• Mixed selection. This is a combination of forward and backward se¬ 
lection. We start with no variables in the model, and as with forward 
selection, we add the variable that provides the best fit. We con¬ 
tinue to add variables one-by-one. Of course, as we noted with the 
Advertising example, the p-values for variables can become larger as 
new predictors are added to the model. Hence, if at any point the 
p-value for one of the variables in the model rises above a certain 
threshold, then we remove that variable from the model. We con¬ 
tinue to perform these forward and backward steps until all variables 
in the model have a sufficiently low p-value, and all variables outside 
the model would have a large p-value if added to the model. 

Backward selection cannot be used if p > n, while forward selection can 
always be used. Forward selection is a greedy approach, and might include 
variables early that later become redundant. Mixed selection can remedy 
this. 

Three: Model Fit 

Two of the most common numerical measures of model fit are the RSE and 
f? 2 , the fraction of variance explained. These quantities are computed and 
interpreted in the same fashion as for simple linear regression. 

Recall that in simple regression, R 2 is the square of the correlation of the 
response and the variable. In multiple linear regression, it turns out that it 
equals Cor (Y,Y) 2 , the square of the correlation between the response and 
the fitted linear model; in fact one property of the fitted linear model is 
that it maximizes this correlation among all possible linear models. 

An R 2 value close to 1 indicates that the model explains a large portion 
of the variance in the response variable. As an example, we saw in Table 3.6 
that for the Advertising data, the model that uses all three advertising me¬ 
dia to predict sales has an R 2 of 0.8972. On the other hand, the model that 
uses only TV and radio to predict sales has an R 2 value of 0.89719. In other 
words, there is a small increase in R 2 if we include newspaper advertising 
in the model that already contains TV and radio advertising, even though 
we saw earlier that the p-value for newspaper advertising in Table 3.4 is not 
significant. It turns out that R 2 will always increase when more variables 
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are added to the model, even if those variables are only weakly associated 
with the response. This is due to the fact that adding another variable to 
the least squares equations must allow us to fit the training data (though 
not necessarily the testing data) more accurately. Thus, the R 2 statistic, 
which is also computed on the training data, must increase. The fact that 
adding newspaper advertising to the model containing only TV and radio 
advertising leads to just a tiny increase in R 2 provides additional evidence 
that newspaper can be dropped from the model. Essentially, newspaper pro¬ 
vides no real improvement in the model fit to the training samples, and its 
inclusion will likely lead to poor results on independent test samples due 
to overfitting. 

In contrast, the model containing only TV as a predictor had an R 2 of 0.61 
(Table 3.2). Adding radio to the model leads to a substantial improvement 
in R 2 . This implies that a model that uses TV and radio expenditures to 
predict sales is substantially better than one that uses only TV advertis¬ 
ing. We could further quantify this improvement by looking at the p-value 
for the radio coefficient in a model that contains only TV and radio as 
predictors. 

The model that contains only TV and radio as predictors has an RSE 
of 1.681, and the model that also contains newspaper as a predictor has 
an RSE of 1.686 (Table 3.6). In contrast, the model that contains only TV 
has an RSE of 3.26 (Table 3.2). This corroborates our previous conclusion 
that a model that uses TV and radio expenditures to predict sales is much 
more accurate (on the training data) than one that only uses TV spending. 
Furthermore, given that TV and radio expenditures are used as predictors, 
there is no point in also using newspaper spending as a predictor in the 
model. The observant reader may wonder how RSE can increase when 
newspaper is added to the model given that RSS must decrease. In general 
RSE is defined as 

RSE = , /---RSS, (3.25) 

V n — p — 1 

which simplifies to (3.15) for a simple linear regression. Thus, models with 
more variables can have higher RSE if the decrease in RSS is small relative 
to the increase in p. 

In addition to looking at the RSE and R 2 statistics just discussed, it 
can be useful to plot the data. Graphical summaries can reveal problems 
with a model that are not visible from numerical statistics. For example, 
Figure 3.5 displays a three-dimensional plot of TV and radio versus sales. 
We see that some observations lie above and some observations lie below 
the least squares regression plane. In particular, the linear model seems to 
overestimate sales for instances in which most of the advertising money 
was spent exclusively on either TV or radio. It underestimates sales for 
instances where the budget was split between the two media. This pro¬ 
nounced non-linear pattern cannot be modeled accurately using linear re- 
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FIGURE 3.5. For the Advertising data, a linear regression fit to sales using 
TV and radio as predictors. From the pattern of the residuals, we can see that 
there is a pronounced non-linear relationship in the data. The positive residuals 
(those visible above the surface), tend to lie along the 45-degree line, where TV 
and Radio budgets are split evenly. The negative residuals (most not visible), tend 
to lie away from this line, where budgets are more lopsided. 

gression. It suggests a synergy or interaction effect between the advertising 
media, whereby combining the media together results in a bigger boost to 
sales than using any single medium. In Section 3.3.2, we will discuss ex¬ 
tending the linear model to accommodate such synergistic effects through 
the use of interaction terms. 

Four: Predictions 

Once we have fit the multiple regression model, it is straightforward to 
apply (3.21) in order to predict the response Y on the basis of a set of 
values for the predictors X±, X 2 ,..., X p . However, there are three sorts of 
uncertainty associated with this prediction. 

1. The coefficient estimates /?o, /3i,..., (3 P are estimates for /3o, fii, ■ • •, (3 p . 
That is, the least squares plane 


Y — fio + P 1 X 1 + • • • + f3 p Xp 

is only an estimate for the true population regression plane 
/( X) = /?o + P 1 X 1 + ■ ■ ■ + (3pX p . 

The inaccuracy in the coefficient estimates is related to the reducible 
error from Chapter 2. We can compute a confidence interval in order 
to determine how close Y will be to f(X). 
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2. Of course, in practice assuming a linear model for f(X) is almost 
always an approximation of reality, so there is an additional source of 
potentially reducible error which we call model bias. So when we use a 
linear model, we are in fact estimating the best linear approximation 
to the true surface. However, here we will ignore this discrepancy, 
and operate as if the linear model were correct. 

3. Even if we knew f(X) —that is, even if we knew the true values 
for Po, Pi,.. ■, P p —the response value cannot be predicted perfectly 
because of the random error e in the model (3.21). In Chapter 2, we 
referred to this as the irreducible error. How much will Y vary from 
Y'l We use prediction intervals to answer this question. Prediction 
intervals are always wider than confidence intervals, because they 
incorporate both the error in the estimate for f(X) (the reducible 
error) and the uncertainty as to how much an individual point will 
differ from the population regression plane (the irreducible error). 

We use a confidence interval to quantify the uncertainty surrounding 
the average sales over a large number of cities. For example, given that 
$100,000 is spent on TV advertising and $20,000 is spent on radio advertising 
in each city, the 95% confidence interval is [10,985, 11,528]. We interpret 
this to mean that 95 % of intervals of this form will contain the true value of 
f(X). 8 On the other hand, a prediction interval can be used to quantify the 
uncertainty surrounding sales for a particular city. Given that $100,000 is 
spent on TV advertising and $20,000 is spent on radio advertising in that city 
the 95% prediction interval is [7,930, 14,580]. We interpret this to mean 
that 95 % of intervals of this form will contain the true value of Y for this 
city. Note that both intervals are centered at 11,256, but that the prediction 
interval is substantially wider than the confidence interval, reflecting the 
increased uncertainty about sales for a given city in comparison to the 
average sales over many locations. 


3.3 Other Considerations in the Regression Model 

3.3.1 Qualitative Predictors 

In our discussion so far, we have assumed that all variables in our linear 
regression model are quantitative. But in practice, this is not necessarily 
the case; often some predictors are qualitative. 


8 In other words, if we collect a large number of data sets like the Advertising data 
set, and we construct a confidence interval for the average sales on the basis of each 
data set (given $100,000 in TV and $20,000 in radio advertising), then 95% of these 
confidence intervals will contain the true value of average sales. 
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For example, the Credit data set displayed in Figure 3.6 records balance 
(average credit card debt for a number of individuals) as well as several 
quantitative predictors: age, cards (number of credit cards), education 
(years of education), income (in thousands of dollars), limit (credit limit), 
and rating (credit rating). Each panel of Figure 3.6 is a scatterplot for a 
pair of variables whose identities are given by the corresponding row and 
column labels. For example, the scatterplot directly to the right of the word 
“Balance” depicts balance versus age, while the plot directly to the right 
of “Age” corresponds to age versus cards. In addition to these quantitative 
variables, we also have four qualitative variables: gender, student (student 
status), status (marital status), and ethnicity (Caucasian, African Amer¬ 
ican or Asian). 


20 40 60 80 100 
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FIGURE 3.6. The Credit data set contains information about balance, age, 
cards, education, income, limit, and rating for a number of potential cus¬ 
tomers. 
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Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

509.80 

33.13 

15.389 

< 0.0001 

gender[Female] 

19.73 

46.05 

0.429 

0.6690 


TABLE 3.7. Least squares coefficient estimates associated with the regression of 
balance onto gender in the Credit data set. The linear model is given in (3.27). 
That is, gender is encoded as a dummy variable, as in (3.26). 


Predictors with Only Two Levels 

Suppose that we wish to investigate differences in credit card balance be¬ 
tween males and females, ignoring the other variables for the moment. If a 
qualitative predictor (also known as a factor ) only has two levels , or possi¬ 
ble values, then incorporating it into a regression model is very simple. We 
simply create an indicator or dummy variable that takes on two possible 
numerical values. For example, based on the gender variable, we can create 
a new variable that takes the form 


Xi = 


if ith person is female 
if ith person is male, 


(3.26) 


and use this variable as a predictor in the regression equation. This results 
in the model 


Vi = P o + Pi Xi + ei 


Po + Pi + Cz 
Po + e i 


if ith person is female 
if ith person is male. 


(3.27) 


Now Po can be interpreted as the average credit card balance among males, 
Po + Pi as the average credit card balance among females, and pi as the 
average difference in credit card balance between females and males. 

Table 3.7 displays the coefficient estimates and other information asso¬ 
ciated with the model (3.27). The average credit card debt for males is 
estimated to be $509.80, whereas females are estimated to carry $19.73 in 
additional debt for a total of $509.80 + $19.73 = $529.53. However, we 
notice that the p-value for the dummy variable is very high. This indicates 
that there is no statistical evidence of a difference in average credit card 
balance between the genders. 

The decision to code females as 1 and males as 0 in (3.27) is arbitrary, and 
has no effect on the regression fit, but does alter the interpretation of the 
coefficients. If we had coded males as 1 and females as 0, then the estimates 
for Po and pi would have been 529.53 and —19.73, respectively, leading once 
again to a prediction of credit card debt of $529.53 — $19.73 = $509.80 for 
males and a prediction of $529.53 for females. Alternatively, instead of a 
0/1 coding scheme, we could create a dummy variable 
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Xi = 



if *th person is female 
if *th person is male 


and use this variable in the regression equation. This results in the model 


„ . I f3o + /3i + ti if ith person is female 

Vi = P 0 + PlXi + £i = < 

|/3o — /?i + a if *th person is male. 

Now /3q can be interpreted as the overall average credit card balance (ig¬ 
noring the gender effect), and /3i is the amount that females are above the 
average and males are below the average. In this example, the estimate for 
/3o would be $519,665, halfway between the male and female averages of 
$509.80 and $529.53. The estimate for /3i would be $9,865, which is half of 
$19.73, the average difference between females and males. It is important to 
note that the final predictions for the credit balances of males and females 
will be identical regardless of the coding scheme used. The only difference 
is in the way that the coefficients are interpreted. 


Qualitative Predictors with More than Two Levels 

When a qualitative predictor has more than two levels, a single dummy 
variable cannot represent all possible values. In this situation, we can create 
additional dummy variables. For example, for the ethnicity variable we 
create two dummy variables. The first could be 



and the second could be 



if ?’th person is Asian 
if ?’th person is not Asian, 

if zth person is Caucasian 
if zth person is not Caucasian. 


(3.28) 


(3.29) 


Then both of these variables can be used in the regression equation, in 
order to obtain the model 


{ Po+fii+d if ith person is Asian 
Po+fa+e-i if ith person is Caucasian 
/3o+Ci if ith person is African American. 

(3.30) 

Now /? 0 can be interpreted as the average credit card balance for African 
Americans, /3i can be interpreted as the difference in the average balance 
between the Asian and African American categories, and /?2 can be inter¬ 
preted as the difference in the average balance between the Caucasian and 
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Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

531.00 

46.32 

11.464 

< 0.0001 

ethnicity[Asian] 

-18.69 

65.02 

-0.287 

0.7740 

ethnicity[Caucasian] 

-12.50 

56.68 

-0.221 

0.8260 


TABLE 3.8. Least squares coefficient estimates associated with the regression 
of balance onto ethnicity in the Credit data set. The linear model is given in 
(3.30). That is, ethnicity is encoded via two dummy variables (3.28) and (3.29). 


African American categories. There will always be one fewer dummy vari¬ 
able than the number of levels. The level with no dummy variable- African 
American in this example—is known as the baseline. 

From Table 3.8, we see that the estimated balance for the baseline, 
African American, is $531.00. It is estimated that the Asian category will 
have $18.69 less debt than the African American category, and that the 
Caucasian category will have $12.50 less debt than the African American 
category. However, the p-values associated with the coefficient estimates for 
the two dummy variables are very large, suggesting no statistical evidence 
of a real difference in credit card balance between the ethnicities. Once 
again, the level selected as the baseline category is arbitrary, and the final 
predictions for each group will be the same regardless of this choice. How¬ 
ever, the coefficients and their p-values do depend on the choice of dummy 
variable coding. Rather than rely on the individual coefficients, we can use 
an F-test to test Hq : = fa = 0; this does not depend on the coding. 

This F-test has a p-value of 0.96, indicating that we cannot reject the null 
hypothesis that there is no relationship between balance and ethnicity. 

Using this dummy variable approach presents no difficulties when in¬ 
corporating both quantitative and qualitative predictors. For example, to 
regress balance on both a quantitative variable such as income and a qual¬ 
itative variable such as student, we must simply create a dummy variable 
for student and then fit a multiple regression model using income and the 
dummy variable as predictors for credit card balance. 

There are many different ways of coding qualitative variables besides 
the dummy variable approach taken here. All of these approaches lead to 
equivalent model fits, but the coefficients are different and have different 
interpretations, and are designed to measure particular contrasts. This topic 
is beyond the scope of the book, and so we will not pursue it further. 


3.3.2 Extensions of the Linear Model 

The standard linear regression model (3.19) provides interpretable results 
and works quite well on many real-world problems. However, it makes sev¬ 
eral highly restrictive assumptions that are often violated in practice. Two 
of the most important assumptions state that the relationship between the 
predictors and response are additive and linear. The additive assumption 
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means that the effect of changes in a predictor X 3 on the response Y is 
independent of the values of the other predictors. The linear assumption 
states that the change in the response Y due to a one-unit change in Xj is 
constant, regardless of the value of Xj. In this book, we examine a number 
of sophisticated methods that relax these two assumptions. Here, we briefly 
examine some common classical approaches for extending the linear model. 

Removing the Additive Assumption 

In our previous analysis of the Advertising data, we concluded that both TV 
and radio seem to be associated with sales. The linear models that formed 
the basis for this conclusion assumed that the effect on sales of increasing 
one advertising medium is independent of the amount spent on the other 
media. For example, the linear model (3.20) states that the average effect 
on sales of a one-unit increase in TV is always p\, regardless of the amount 
spent on radio. 

However, this simple model may be incorrect. Suppose that spending 
money on radio advertising actually increases the effectiveness of TV ad¬ 
vertising, so that the slope term for TV should increase as radio increases. 
In this situation, given a fixed budget of $100,000, spending half on radio 
and half on TV may increase sales more than allocating the entire amount 
to either TV or to radio. In marketing, this is known as a synergy effect, 
and in statistics it is referred to as an interaction effect. Figure 3.5 sug¬ 
gests that such an effect may be present in the advertising data. Notice 
that when levels of either TV or radio are low, then the true sales are lower 
than predicted by the linear model. But when advertising is split between 
the two media, then the model tends to underestimate sales. 

Consider the standard linear regression model with two variables, 


Y = /3 0 + PiX ± + p 2 X 2 + e. 

According to this model, if we increase X\ by one unit, then Y will increase 
by an average of Pi units. Notice that the presence of X 2 does not alter 
this statement—that is, regardless of the value of X 2 , a one-unit increase 
in X-[ will lead to a /3i-unit increase in Y. One way of extending this model 
to allow for interaction effects is to include a third predictor, called an 
interaction term, which is constructed by computing the product of X\ 
and X 2 . This results in the model 

Y = fa + PiXi + p 2 X 2 + /3 3 ATX 2 + e. (3.31) 

How does inclusion of this interaction term relax the additive assumption? 
Notice that (3.31) can be rewritten as 


Y 


Po + {Pi + P^X 2 )Xi + p 2 X 2 + e 
Pa + P\X\ + p 2 X 2 + e 


(3.32) 
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Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

6.7502 

0.248 

27.23 

< 0.0001 

TV 

0.0191 

0.002 

12.70 

< 0.0001 

radio 

0.0289 

0.009 

3.24 

0.0014 

TV X radio 

0.0011 

0.000 

20.73 

< 0.0001 


TABLE 3.9. For the Advertising data, least squares coefficient estimates asso¬ 
ciated with the regression of sales onto TV and radio, with an interaction term, 
as in (3.33). 

where pi = Pi + P 0 X 2 . Since Pi changes with X 2 , the effect of X\ on Y is 
no longer constant: adjusting X 2 will change the impact of Xi on Y. 

For example, suppose that we are interested in studying the productiv¬ 
ity of a factory. We wish to predict the number of units produced on the 
basis of the number of production lines and the total number of workers. 
It seems likely that the effect of increasing the number of production lines 
will depend on the number of workers, since if no workers are available 
to operate the lines, then increasing the number of lines will not increase 
production. This suggests that it would be appropriate to include an inter¬ 
action term between lines and workers in a linear model to predict units. 
Suppose that when we fit the model, we obtain 

units ss 1.2 + 3.4 X lines + 0.22 X workers + 1.4 X (lines X workers) 
= 1.2 + (3.4 + 1.4 X workers) X lines + 0.22 X workers. 

In other words, adding an additional line will increase the number of units 
produced by 3.4 + 1.4 x workers. Hence the more workers we have, the 
stronger will be the effect of lines. 

We now return to the Advertising example. A linear model that uses 
radio, TV, and an interaction between the two to predict sales takes the 
form 


sales = Po + pi X TV + P2 X radio + p 3 X (radio X TV) + e 

= po + (Pi + Po X radio) X TV + P2 X radio + e. (3.33) 

We can interpret po as the increase in the effectiveness of TV advertising 
for a one unit increase in radio advertising (or vice-versa). The coefficients 
that result from fitting the model (3.33) are given in Table 3.9. 

The results in Table 3.9 strongly suggest that the model that includes the 
interaction term is superior to the model that contains only main effects. 
The p-value for the interaction term, TV x radio, is extremely low, indicating 
that there is strong evidence for H a : p 3 0. In other words, it is clear that 
the true relationship is not additive. The R 2 for the model (3.33) is 96.8 %, 
compared to only 89.7% for the model that predicts sales using TV and 
radio without an interaction term. This means that (96.8 — 89.7)/(100 — 
89.7) = 69% of the variability in sales that remains after fitting the ad¬ 
ditive model has been explained by the interaction term. The coefficient 
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estimates in Table 3.9 suggest that an increase in TV advertising of $1,000 is 
associated with increased sales of (0i+03 x radio) x 1,000 = 19+1.1 x radio 
units. And an increase in radio advertising of $1,000 will be associated with 
an increase in sales of (02 + 03 x TV) x 1,000 = 29 + 1.1 x TV units. 

In this example, the p-values associated with TV, radio, and the interac¬ 
tion term all are statistically significant (Table 3.9), and so it is obvious 
that all three variables should be included in the model. However, it is 
sometimes the case that an interaction term has a very small p-value, but 
the associated main effects (in this case, TV and radio) do not. The hier¬ 
archical principle states that if we include an interaction in a model, we 
should also include the main effects, even if the p-values associated with 
their coefficients are not significant. In other words, if the interaction be¬ 
tween X\ and X 2 seems important, then we should include both X\ and 
X 2 in the model even if their coefficient estimates have large p-values. The 
rationale for this principle is that if X- t x X 2 is related to the response, 
then whether or not the coefficients of X\ or X 2 are exactly zero is of lit¬ 
tle interest. Also X\ x X 2 is typically correlated with X\ and X 2 , and so 
leaving them out tends to alter the meaning of the interaction. 

In the previous example, we considered an interaction between TV and 
radio, both of which are quantitative variables. However, the concept of 
interactions applies just as well to qualitative variables, or to a combination 
of quantitative and qualitative variables. In fact, an interaction between 
a qualitative variable and a quantitative variable has a particularly nice 
interpretation. Consider the Credit data set from Section 3.3.1, and suppose 
that we wish to predict balance using the income (quantitative) and student 
(qualitative) variables. In the absence of an interaction term, the model 
takes the form 

I 02 if ith person is a student 
balance^ ft 0 q + 0\ X income^ + < 

I 0 if ith person is not a student 

I 0o + 02 if ith person is a student 

= pi X income^ + < 

I 0o if ith person is not a student. 

(3.34) 

Notice that this amounts to fitting two parallel lines to the data, one for 
students and one for non-students. The lines for students and non-students 
have different intercepts, 0o + 02 versus 0o, but the same slope, 0\. This 
is illustrated in the left-hand panel of Figure 3.7. The fact that the lines 
are parallel means that the average effect on balance of a one-unit increase 
in income does not depend on whether or not the individual is a student. 
This represents a potentially serious limitation of the model, since in fact a 
change in income may have a very different effect on the credit card balance 
of a student versus a non-student. 

This limitation can be addressed by adding an interaction variable, cre¬ 
ated by multiplying income with the dummy variable for student. Our 
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FIGURE 3.7. For the Credit data, the least squares lines are shown for pre¬ 
diction of balance from income for students and non-students. Left: The model 
(3.34) was fit. There is no interaction between income and student. Right: The 
model (3.35) was fit. There is an interaction term between income and student. 

model now becomes 

, , 0,0.. , j /®2 + P 3 X income^ if student 

balance^ ~ po + p\ X income^ + < 

I 0 if not student 

J (fio + P 2 ) + (Pi + P 3 ) X incomej if student 
1 Po + Pi X income^ if not student 

(3.35) 

Once again, we have two different regression lines for the students and 
the non-students. But now those regression lines have different intercepts, 
P 0 +P 2 versus Po, as well as different slopes, P 1 +P 3 versus Pi. This allows for 
the possibility that changes in income may affect the credit card balances 
of students and non-students differently. The right-hand panel of Figure 3.7 
shows the estimated relationships between income and balance for students 
and non-students in the model (3.35). We note that the slope for students 
is lower than the slope for non-students. This suggests that increases in 
income are associated with smaller increases in credit card balance among 
students as compared to non-students. 


Non-linear Relationships 

As discussed previously, the linear regression model (3.19) assumes a linear 
relationship between the response and predictors. But in some cases, the 
true relationship between the response and the predictors may be non¬ 
linear. Here we present a very simple way to directly extend the linear model 
to accommodate non-linear relationships, using polynomial regression. In 
later chapters, we will present more complex approaches for performing 
non-linear fits in more general settings. 

Consider Figure 3.8, in which the mpg (gas mileage in miles per gallon) 
versus horsepower is shown for a number of cars in the Auto data set. The 
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FIGURE 3.8. The Auto data set. For a number of cars, mpg and horsepower are 
shown. The linear regression fit is shown in orange. The linear regression fit for a 
model that includes horsepower 2 is shown as a blue curve. The linear regression 
fit for a model that includes all polynomials of horsepower up to fifth-degree is 
shown in green. 


orange line represents the linear regression fit. There is a pronounced rela¬ 
tionship between mpg and horsepower, but it seems clear that this relation¬ 
ship is in fact non-linear: the data suggest a curved relationship. A simple 
approach for incorporating non-linear associations in a linear model is to 
include transformed versions of the predictors in the model. For example, 
the points in Figure 3.8 seem to have a quadratic shape, suggesting that a 
model of the form 


quadratic 


mpg = fa + fa X horsepower + fa X horsepower 2 + e (3.36) 

may provide a better fit. Equation 3.36 involves predicting mpg using a 
non-linear function of horsepower. But it is still a linear model! That is, 
(3.36) is simply a multiple linear regression model with Xi = horsepower 
and Xi = horsepower 2 . So we can use standard linear regression software to 
estimate fa, fa, and fa in order to produce a non-linear fit. The blue curve 
in Figure 3.8 shows the resulting quadratic fit to the data. The quadratic 
fit appears to be substantially better than the fit obtained when just the 
linear term is included. The R 2 of the quadratic fit is 0.688, compared to 
0.606 for the linear fit, and the p-value in Table 3.10 for the quadratic term 
is highly significant. 

If including horsepower 2 led to such a big improvement in the model, why 
not include horsepower' 3 , horsepower 4 , or even horsepower ’? The green curve 
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Coefficient 

Std. error 

t-statistic 

p-value 

Intercept 

56.9001 

1.8004 

31.6 

< 0.0001 

horsepower 

-0.4662 

0.0311 

-15.0 

< 0.0001 

horsepower 2 

0.0012 

0.0001 

10.1 

< 0.0001 


TABLE 3.10. For the Auto data set, least squares coefficient estimates associated 
with the regression of mpg onto horsepower and horsepower”. 

in Figure 3.8 displays the fit that results from including all polynomials up 
to fifth degree in the model (3.36). The resulting fit seems unnecessarily 
wiggly—that is, it is unclear that including the additional terms really has 
led to a better fit to the data. 

The approach that we have just described for extending the linear model 
to accommodate non-linear relationships is known as polynomial regres¬ 
sion , since we have included polynomial functions of the predictors in the 
regression model. We further explore this approach and other non-linear 
extensions of the linear model in Chapter 7. 


3.3.3 Potential Problems 

When we fit a linear regression model to a particular data set, many prob¬ 
lems may occur. Most common among these are the following: 

1. Non-linearity of the response-predictor relationships. 

2. Correlation of error terms. 

3. Non-constant variance of error terms. 

4. Outliers. 

5. High-leverage points. 

6. Collinearity. 

In practice, identifying and overcoming these problems is as much an 
art as a science. Many pages in countless books have been written on this 
topic. Since the linear regression model is not our primary focus here, we 
will provide only a brief summary of some key points. 

1. Non-linearity of the Data 

The linear regression model assumes that there is a straight-line relation¬ 
ship between the predictors and the response. If the true relationship is 
far from linear, then virtually all of the conclusions that we draw from the 
fit are suspect. In addition, the prediction accuracy of the model can be 
significantly reduced. 

Residual plots are a useful graphical tool for identifying non-linearity. ^ ^ t 

Given a simple linear regression model, we can plot the residuals, = 

Vi — {ji, versus the predictor Xi. In the case of a multiple regression model, 






3.3 Other Considerations in the Regression Model 93 



FIGURE 3.9. Plots of residuals versus predicted (or fitted) values for the Auto 
data set. In each plot, the red line is a smooth fit to the residuals, intended to make 
it easier to identify a trend. Left: A linear regression of mpg on horsepower. A 
strong pattern in the residuals indicates non-linearity in the data. Right: A linear 
regression of mpg on horsepower and horsepower". There is little pattern in the 
residuals. 

since there are multiple predictors, we instead plot the residuals versus 
the predicted (or fitted) values yt. Ideally, the residual plot will show no 
discernible pattern. The presence of a pattern may indicate a problem with 
some aspect of the linear model. 

The left panel of Figure 3.9 displays a residual plot from the linear 
regression of mpg onto horsepower on the Auto data set that was illustrated 
in Figure 3.8. The red line is a smooth fit to the residuals, which is displayed 
in order to make it easier to identify any trends. The residuals exhibit a 
clear U-shape, which provides a strong indication of non-linearity in the 
data. In contrast, the right-hand panel of Figure 3.9 displays the residual 
plot that results from the model (3.36), which contains a quadratic term. 
There appears to be little pattern in the residuals, suggesting that the 
quadratic term improves the fit to the data. 

If the residual plot indicates that there are non-linear associations in the 
data, then a simple approach is to use non-linear transformations of the 
predictors, such as logX, y/X, and X 2 , in the regression model. In the 
later chapters of this book, we will discuss other more advanced non-linear 
approaches for addressing this issue. 

2. Correlation of Error Terms 

An important assumption of the linear regression model is that the error 
terms, ei, £ 2 ,..., e n , are uncorrelated. What does this mean? For instance, 
if the errors are uncorrelated, then the fact that is positive provides 
little or no information about the sign of e*+ 1 . The standard errors that 
are computed for the estimated regression coefficients or the fitted values 


fitted 








94 


3. Linear Regression 


are based on the assumption of uncorrelated error terms. If in fact there 
is correlation among the error terms, then the estimated standard errors 
will tend to underestimate the true standard errors. As a result, confi¬ 
dence and prediction intervals will be narrower than they should be. For 
example, a 95 % confidence interval may in reality have a much lower prob¬ 
ability than 0.95 of containing the true value of the parameter. In addition, 
p-values associated with the model will be lower than they should be; this 
could cause us to erroneously conclude that a parameter is statistically 
significant. In short, if the error terms are correlated, we may have an 
unwarranted sense of confidence in our model. 

As an extreme example, suppose we accidentally doubled our data, lead¬ 
ing to observations and error terms identical in pairs. If we ignored this, our 
standard error calculations would be as if we had a sample of size 2n, when 
in fact we have only n samples. Our estimated parameters would be the 
same for the 2 n samples as for the n samples, but the confidence intervals 
would be narrower by a factor of \[2\ 

Why might correlations among the error terms occur? Such correlations 
frequently occur in the context of time series data, which consists of ob¬ 
servations for which measurements are obtained at discrete points in time. 
In many cases, observations that are obtained at adjacent time points will 
have positively correlated errors. In order to determine if this is the case for 
a given data set, we can plot the residuals from our model as a function of 
time. If the errors are uncorrelated, then there should be no discernible pat¬ 
tern. On the other hand, if the error terms are positively correlated, then 
we may see tracking in the residuals—that is, adjacent residuals may have 
similar values. Figure 3.10 provides an illustration. In the top panel, we see 
the residuals from a linear regression fit to data generated with uncorre¬ 
lated errors. There is no evidence of a time-related trend in the residuals. 
In contrast, the residuals in the bottom panel are from a data set in which 
adjacent errors had a correlation of 0.9. Now there is a clear pattern in the 
residuals—adjacent residuals tend to take on similar values. Finally, the 
center panel illustrates a more moderate case in which the residuals had a 
correlation of 0.5. There is still evidence of tracking, but the pattern is less 
clear. 

Many methods have been developed to properly take account of corre¬ 
lations in the error terms in time series data. Correlation among the error 
terms can also occur outside of time series data. For instance, consider a 
study in which individuals’ heights are predicted from their weights. The 
assumption of uncorrelated errors could be violated if some of the individ¬ 
uals in the study are members of the same family, or eat the same diet, 
or have been exposed to the same environmental factors. In general, the 
assumption of uncorrelated errors is extremely important for linear regres¬ 
sion as well as for other statistical methods, and good experimental design 
is crucial in order to mitigate the risk of such correlations. 
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FIGURE 3.10. Plots of residuals from simulated time series data sets generated 
with differing levels of correlation p between error terms for adjacent time points. 


3. Non-constant Variance of Error Terms 

Another important assumption of the linear regression model is that the 
error terms have a constant variance, Var(e,) = a 2 . The standard errors, 
confidence intervals, and hypothesis tests associated with the linear model 
rely upon this assumption. 

Unfortunately, it is often the case that the variances of the error terms are 
non-constant. For instance, the variances of the error terms may increase 
with the value of the response. One can identify non-constant variances in 
the errors, or heteroscedasticity , from the presence of a funnel shape in 
the residual plot. An example is shown in the left-hand panel of Figure 3.11, 
in which the magnitude of the residuals tends to increase with the fitted 
values. When faced with this problem, one possible solution is to trans¬ 
form the response Y using a concave function such as log Y or y/Y. Such 
a transformation results in a greater amount of shrinkage of the larger re¬ 
sponses, leading to a reduction in heteroscedasticity. The right-hand panel 
of Figure 3.11 displays the residual plot after transforming the response 
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FIGURE 3.11. Residual plots. In each plot, the red line is a smooth fit to the 
residuals, intended to make it easier to identify a trend. The blue lines track the 
outer quantiles of the residuals, and emphasize patterns. Left: The funnel shape 
indicates heteroscedasticity. Right: The predictor has been log-transformed, and 
there is now no evidence of heteroscedasticity. 


using log Y. The residuals now appear to have constant variance, though 
there is some evidence of a slight non-linear relationship in the data. 

Sometimes we have a good idea of the variance of each response. For 
example, the zth response could be an average of rii raw observations. If 
each of these raw observations is uncorrelated with variance a 2 , then their 
average has variance of = a 2 /rii. In this case a simple remedy is to fit our 
model by weighted least squares , with weights proportional to the inverse 
variances- i.e. Wi = rii in this case. Most linear regression software allows 
for observation weights. 

4. Outliers 

An outlier is a point for which yi is far from the value predicted by the 
model. Outliers can arise for a variety of reasons, such as incorrect recording 
of an observation during data collection. 

The red point (observation 20) in the left-hand panel of Figure 3.12 
illustrates a typical outlier. The red solid line is the least squares regression 
fit, while the blue dashed line is the least squares fit after removal of the 
outlier. In this case, removing the outlier has little effect on the least squares 
line: it leads to almost no change in the slope, and a miniscule reduction 
in the intercept. It is typical for an outlier that does not have an unusual 
predictor value to have little effect on the least squares fit. However, even 
if an outlier does not have much effect on the least squares fit, it can cause 
other problems. For instance, in this example, the RSE is 1.09 when the 
outlier is included in the regression, but it is only 0.77 when the outlier 
is removed. Since the RSE is used to compute all confidence intervals and 
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FIGURE 3.12. Left: The least squares regression line is shown in red, and the 
regression line after removing the outlier is shown in blue. Center: The residual 
plot clearly identifies the outlier. Right: The outlier has a studentized residual of 
6; typically we expect values between —3 and 3. 


p-values, such a dramatic increase caused by a single data point can have 
implications for the interpretation of the fit. Similarly, inclusion of the 
outlier causes the R 2 to decline from 0.892 to 0.805. 

Residual plots can be used to identify outliers. In this example, the out¬ 
lier is clearly visible in the residual plot illustrated in the center panel of 
Figure 3.12. But in practice, it can be difficult to decide how large a resid¬ 
ual needs to be before we consider the point to be an outlier. To address 
this problem, instead of plotting the residuals, we can plot the studentized 
residuals , computed by dividing each residual e* by its estimated standard 
error. Observations whose studentized residuals are greater than 3 in abso¬ 
lute value are possible outliers. In the right-hand panel of Figure 3.12, the 
outlier’s studentized residual exceeds 6, while all other observations have 
studentized residuals between —2 and 2. 

If we believe that an outlier has occurred due to an error in data collec¬ 
tion or recording, then one solution is to simply remove the observation. 
However, care should be taken, since an outlier may instead indicate a 
deficiency with the model, such as a missing predictor. 

5. High Leverage Points 

We just saw that outliers are observations for which the response y,; is 
unusual given the predictor ay. In contrast, observations with high leverage 
have an unusual value for ay. For example, observation 41 in the left-hand 
panel of Figure 3.13 has high leverage, in that the predictor value for this 
observation is large relative to the other observations. (Note that the data 
displayed in Figure 3.13 are the same as the data displayed in Figure 3.12, 
but with the addition of a single high leverage observation.) The red solid 
line is the least squares fit to the data, while the blue dashed line is the 
fit produced when observation 41 is removed. Comparing the left-hand 
panels of Figures 3.12 and 3.13, we observe that removing the high leverage 
observation has a much more substantial impact on the least squares line 
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FIGURE 3.13. Left: Observation 41 is a high leverage point, while 20 is not. 
The red line is the fit to all the data, and the blue line is the fit with observation 
41 removed. Center: The red observation is not unusual in terms of its A'i value 
or its X 2 value, but still falls outside the bulk of the data, and hence has high 
leverage. Right: Observation 41 has a high leverage and a high residual. 


than removing the outlier. In fact, high leverage observations tend to have 
a sizable impact on the estimated regression line. It is cause for concern if 
the least squares line is heavily affected by just a couple of observations, 
because any problems with these points may invalidate the entire fit. For 
this reason, it is important to identify high leverage observations. 

In a simple linear regression, high leverage observations are fairly easy to 
identify, since we can simply look for observations for which the predictor 
value is outside of the normal range of the observations. But in a multiple 
linear regression with many predictors, it is possible to have an observation 
that is well within the range of each individual predictor’s values, but that 
is unusual in terms of the full set of predictors. An example is shown in 
the center panel of Figure 3.13, for a data set with two predictors, X\ and 
X 2 . Most of the observations’ predictor values fall within the blue dashed 
ellipse, but the red observation is well outside of this range. But neither its 
value for X\ nor its value for X 2 is unusual. So if we examine just A'i or 
just X 2 , we will fail to notice this high leverage point. This problem is more 
pronounced in multiple regression settings with more than two predictors, 
because then there is no simple way to plot all dimensions of the data 
simultaneously. 

In order to quantify an observation’s leverage, we compute the leverage 
statistic. A large value of this statistic indicates an observation with high 
leverage. For a simple linear regression, 


hi 


1 (Xj - x ) 2 

n EiLiOc*' -x) 2 ' 


(3.37) 


It is clear from this equation that hi increases with the distance of Xi from x. 
There is a simple extension of hi to the case of multiple predictors, though 
we do not provide the formula here. The leverage statistic hi is always 
between 1/n and 1, and the average leverage for all the observations is 
always equal to (p+ 1 )/n. So if a given observation has a leverage statistic 
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FIGURE 3.14. Scatterplots of the observations from the Credit data set. Left: 
A plot of age versus limit. These two variables are not collinear. Right: A plot 
of rating versus limit. There is high collinearity. 


that greatly exceeds (p+l)/n, then we may suspect that the corresponding 
point has high leverage. 

The right-hand panel of Figure 3.13 provides a plot of the studentized 
residuals versus hi for the data in the left-hand panel of Figure 3.13. Ob¬ 
servation 41 stands out as having a very high leverage statistic as well as a 
high studentized residual. In other words, it is an outlier as well as a high 
leverage observation. This is a particularly dangerous combination! This 
plot also reveals the reason that observation 20 had relatively little effect 
on the least squares fit in Figure 3.12: it has low leverage. 


6. Collinearity 

Collinearity refers to the situation in which two or more predictor variables 
are closely related to one another. The concept of collinearity is illustrated 
in Figure 3.14 using the Credit data set. In the left-hand panel of Fig¬ 
ure 3.14, the two predictors limit and age appear to have no obvious rela¬ 
tionship. In contrast, in the right-hand panel of Figure 3.14, the predictors 
limit and rating are very highly correlated with each other, and we say 
that they are collinear. The presence of collinearity can pose problems in 
the regression context, since it can be difficult to separate out the indi¬ 
vidual effects of collinear variables on the response. In other words, since 
limit and rating tend to increase or decrease together, it can be difficult to 
determine how each one separately is associated with the response, balance. 

Figure 3.15 illustrates some of the difficulties that can result from collinear¬ 
ity. The left-hand panel of Figure 3.15 is a contour plot of the RSS (3.22) 
associated with different possible coefficient estimates for the regression 
of balance on limit and age. Each ellipse represents a set of coefficients 
that correspond to the same RSS, with ellipses nearest to the center tak¬ 
ing on the lowest values of RSS. The black dots and associated dashed 
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FIGURE 3.15. Contour plots for the RSS values as a function of the parameters 
f) for various regressions involving the Credit data set. In each plot, the black 
dots represent the coefficient values corresponding to the minimum RSS. Left: 
A contour plot of RSS for the regression of balance onto age and limit. The 
minimum value is well defined. Right: A contour plot of RSS for the regression 
of balance onto rating and limit. Because of the collinearity, there are many 
pairs (/SLimit,/^Rating) with a similar value for RSS. 


lines represent the coefficient estimates that result in the smallest possible 
RSS—in other words, these are the least squares estimates. The axes for 
limit and age have been scaled so that the plot includes possible coeffi¬ 
cient estimates that are up to four standard errors on either side of the 
least squares estimates. Thus the plot includes all plausible values for the 
coefficients. For example, we see that the true limit coefficient is almost 
certainly somewhere between 0.15 and 0.20. 

In contrast, the right-hand panel of Figure 3.15 displays contour plots 
of the RSS associated with possible coefficient estimates for the regression 
of balance onto limit and rating, which we know to be highly collinear. 
Now the contours run along a narrow valley; there is a broad range of 
values for the coefficient estimates that result in equal values for RSS. 
Hence a small change in the data could cause the pair of coefficient values 
that yield the smallest RSS—that is, the least squares estimates—to move 
anywhere along this valley. This results in a great deal of uncertainty in the 
coefficient estimates. Notice that the scale for the limit coefficient now runs 
from roughly —0.2 to 0.2; this is an eight-fold increase over the plausible 
range of the limit coefficient in the regression with age. Interestingly, even 
though the limit and rating coefficients now have much more individual 
uncertainty, they will almost certainly lie somewhere in this contour valley. 
For example, we would not expect the true value of the limit and rating 
coefficients to be —0.1 and 1 respectively, even though such a value is 
plausible for each coefficient individually. 
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Coefficient 

Std. error 

t-statistic 

p-value 


Intercept 

— 173.411 

43.828 

-3.957 

< 0.0001 

Model 1 

age 

-2.292 

0.672 

-3.407 

0.0007 


limit 

0.173 

0.005 

34.496 

< 0.0001 


Intercept 

-377.537 

45.254 

-8.343 

< 0.0001 

Model 2 

rating 

2.202 

0.952 

2.312 

0.0213 


limit 

0.025 

0.064 

0.384 

0.7012 


TABLE 3.11. The results for two multiple regression models involving the 
Credit data set are shown. Model 1 is a regression of balance on age and limit, 
and Model 2 a regression of balance on rating and limit. The standard error 
of /3iimit increases 12-fold in the second regression, due to collinearity. 


Since collinearity reduces the accuracy of the estimates of the regression 
coefficients, it causes the standard error for (3j to grow. Recall that the 
f-statistic for each predictor is calculated by dividing $j by its standard 
error. Consequently, collinearity results in a decline in the f-statistic. As a 
result, in the presence of collinearity, we may fail to reject Hq : /3j = 0. This 
means that the power of the hypothesis test—the probability of correctly 
detecting a non-zero coefficient—is reduced by collinearity. 

Table 3.11 compares the coefficient estimates obtained from two separate 
multiple regression models. The first is a regression of balance on age and 
limit, and the second is a regression of balance on rating and limit. In the 
first regression, both age and limit are highly significant with very small p- 
values. In the second, the collinearity between limit and rating has caused 
the standard error for the limit coefficient estimate to increase by a factor 
of 12 and the p-value to increase to 0.701. In other words, the importance 
of the limit variable has been masked due to the presence of collinearity. 
To avoid such a situation, it is desirable to identify and address potential 
collinearity problems while fitting the model. 

A simple way to detect collinearity is to look at the correlation matrix 
of the predictors. An element of this matrix that is large in absolute value 
indicates a pair of highly correlated variables, and therefore a collinearity 
problem in the data. Unfortunately, not all collinearity problems can be 
detected by inspection of the correlation matrix: it is possible for collinear¬ 
ity to exist between three or more variables even if no pair of variables 
has a particularly high correlation. We call this situation multicollinearity. 
Instead of inspecting the correlation matrix, a better way to assess multi¬ 
collinearity is to compute the variance inflation factor (VIF). The VIF is 
the ratio of the variance of f3j when fitting the full model divided by the 
variance of f3j if fit on its own. The smallest possible value for VIF is 1, 
which indicates the complete absence of collinearity. Typically in practice 
there is a small amount of collinearity among the predictors. As a rule of 
thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of 
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collinearity. The VIF for each variable can be computed using the formula 


VIFifij) 


1 R %\x^ 


where R x is the R 2 from a regression of Xj onto all of the other 
predictors. If R x .\ x _, is close to one, then collinearity is present, and so 
the VIF will be large. 

In the Credit data, a regression of balance on age, rating, and limit 
indicates that the predictors have VIF values of 1.01, 160.67, and 160.59. 
As we suspected, there is considerable collinearity in the data! 

When faced with the problem of collinearity, there are two simple solu¬ 
tions. The first is to drop one of the problematic variables from the regres¬ 
sion. This can usually be done without much compromise to the regression 
fit, since the presence of collinearity implies that the information that this 
variable provides about the response is redundant in the presence of the 
other variables. For instance, if we regress balance onto age and limit, 
without the rating predictor, then the resulting VIF values are close to 
the minimum possible value of 1, and the R 2 drops from 0.754 to 0.75. 
So dropping rating from the set of predictors has effectively solved the 
collinearity problem without compromising the fit. The second solution is 
to combine the collinear variables together into a single predictor. For in¬ 
stance, we might take the average of standardized versions of limit and 
rating in order to create a new variable that measures credit worthiness. 


3.4 The Marketing Plan 

We now briefly return to the seven questions about the Advertising data 
that we set out to answer at the beginning of this chapter. 


1. Is there a relationship between advertising sales and budget? 

This question can be answered by fitting a multiple regression model 
of sales onto TV, radio, and newspaper, as in (3.20), and testing the 
hypothesis H 0 : /3 TV = /3 ra di 0 = /^newspaper = 0. In Section 3.2.2, 
we showed that the F-statistic can be used to determine whether or 
not we should reject this null hypothesis. In this case the p-value 
corresponding to the F-statistic in Table 3.6 is very low, indicating 
clear evidence of a relationship between advertising and sales. 

2. How strong is the relationship? 

We discussed two measures of model accuracy in Section 3.1.3. First, 
the RSE estimates the standard deviation of the response from the 
population regression line. For the Advertising data, the RSE is 1,681 
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units while the mean value for the response is 14,022, indicating a 
percentage error of roughly 12%. Second, the R 2 statistic records 
the percentage of variability in the response that is explained by 
the predictors. The predictors explain almost 90 % of the variance in 
sales. The RSE and R 2 statistics are displayed in Table 3.6. 

3. Which media contribute to sales? 

To answer this question, we can examine the p-values associated with 
each predictor’s t-statistic (Section 3.1.2). In the multiple linear re¬ 
gression displayed in Table 3.4, the p-values for TV and radio are low, 
but the p-value for newspaper is not. This suggests that only TV and 
radio are related to sales. In Chapter 6 we explore this question in 
greater detail. 

4. How large is the effect of each medium on sales? 

We saw in Section 3.1.2 that the standard error of fdj can be used 
to construct confidence intervals for f3j. For the Advertising data, 
the 95% confidence intervals are as follows: (0.043,0.049) for TV, 
(0.172, 0.206) for radio, and (—0.013, 0.011) for newspaper. The confi¬ 
dence intervals for TV and radio are narrow and far from zero, provid¬ 
ing evidence that these media are related to sales. But the interval 
for newspaper includes zero, indicating that the variable is not statis¬ 
tically significant given the values of TV and radio. 

We saw in Section 3.3.3 that collinearity can result in very wide stan¬ 
dard errors. Could collinearity be the reason that the confidence in¬ 
terval associated with newspaper is so wide? The VIF scores are 1.005, 
1.145, and 1.145 for TV, radio, and newspaper, suggesting no evidence 
of collinearity. 

In order to assess the association of each medium individually on 
sales, we can perform three separate simple linear regressions. Re¬ 
sults are shown in Tables 3.1 and 3.3. There is evidence of an ex¬ 
tremely strong association between TV and sales and between radio 
and sales. There is evidence of a mild association between newspaper 
and sales, when the values of TV and radio are ignored. 

5. How accurately can we predict future sales? 

The response can be predicted using (3.21). The accuracy associ¬ 
ated with this estimate depends on whether we wish to predict an 
individual response, Y = f(X) + e, or the average response, /( X) 
(Section 3.2.2). If the former, we use a prediction interval, and if the 
latter, we use a confidence interval. Prediction intervals will always 
be wider than confidence intervals because they account for the un¬ 
certainty associated with e, the irreducible error. 
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6. Is the relationship linear? 

In Section 3.3.3, we saw that residual plots can be used in order to 
identify non-linearity. If the relationships are linear, then the residual 
plots should display no pattern. In the case of the Advertising data, 
we observe a non-linear effect in Figure 3.5, though this effect could 
also be observed in a residual plot. In Section 3.3.2, we discussed the 
inclusion of transformations of the predictors in the linear regression 
model in order to accommodate non-linear relationships. 

7. Is there synergy among the advertising media? 

The standard linear regression model assumes an additive relation¬ 
ship between the predictors and the response. An additive model is 
easy to interpret because the effect of each predictor on the response is 
unrelated to the values of the other predictors. However, the additive 
assumption may be unrealistic for certain data sets. In Section 3.3.3, 
we showed how to include an interaction term in the regression model 
in order to accommodate non-additive relationships. A small p-value 
associated with the interaction term indicates the presence of such 
relationships. Figure 3.5 suggested that the Advertising data may 
not be additive. Including an interaction term in the model results in 
a substantial increase in R 2 , from around 90% to almost 97%. 

3.5 Comparison of Linear Regression 
with /i-Nearest Neighbors 

As discussed in Chapter 2, linear regression is an example of a parametric 
approach because it assumes a linear functional form for /( X). Parametric 
methods have several advantages. They are often easy to fit, because one 
need estimate only a small number of coefficients. In the case of linear re¬ 
gression, the coefficients have simple interpretations, and tests of statistical 
significance can be easily performed. But parametric methods do have a 
disadvantage: by construction, they make strong assumptions about the 
form of /( X). If the specified functional form is far from the truth, and 
prediction accuracy is our goal, then the parametric method will perform 
poorly. For instance, if we assume a linear relationship between X and Y 
but the true relationship is far from linear, then the resulting model will 
provide a poor fit to the data, and any conclusions drawn from it will be 
suspect. 

In contrast, non-parametric methods do not explicitly assume a para¬ 
metric form for f(X ), and thereby provide an alternative and more flexi¬ 
ble approach for performing regression. We discuss various non-parametric 
methods in this book. Here we consider one of the simplest and best-known 
non-parametric methods, K-nearest neighbors regression (KNN regression). K ^ 

neighbors 

regression 


3.5 Comparison of Linear Regression with iv-Nearest Neighbors 105 



FIGURE 3.16. Plots of f(X) using KNN regression on a two-dimensional data 
set with 64 observations (orange dots). Left: K — 1 results in a rough step func¬ 
tion fit. Right: K = 9 produces a much smoother fit. 


The KNN regression method is closely related to the KNN classifier dis¬ 
cussed in Chapter 2. Given a value for K and a prediction point Xo, KNN 
regression first identifies the K training observations that are closest to 
Xq , represented by JVq. It then estimates /( xq) using the average of all the 
training responses in Af a . In other words, 

/Oo) = y ^2 Vi- 

Xi&Mo 


Figure 3.16 illustrates two KNN fits on a data set with p = 2 predictors. 
The fit with K = 1 is shown in the left-hand panel, while the right-hand 
panel corresponds to K = 9. We see that when K = 1, the KNN fit perfectly 
interpolates the training observations, and consequently takes the form of 
a step function. When K = 9, the KNN fit still is a step function, but 
averaging over nine observations results in much smaller regions of constant 
prediction, and consequently a smoother fit. In general, the optimal value 
for K will depend on the bias-variance tradeoff ., which we introduced in 
Chapter 2. A small value for K provides the most flexible fit, which will 
have low bias but high variance. This variance is due to the fact that the 
prediction in a given region is entirely dependent on just one observation. 
In contrast, larger values of K provide a smoother and less variable fit; the 
prediction in a region is an average of several points, and so changing one 
observation has a smaller effect. However, the smoothing may cause bias by 
masking some of the structure in /( X). In Chapter 5, we introduce several 
approaches for estimating test error rates. These methods can be used to 
identify the optimal value of K in KNN regression. 
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In what setting will a parametric approach such as least squares linear re¬ 
gression outperform a non-parametric approach such as KNN regression? 
The answer is simple: the parametric approach will outperform the non- 
parametric approach if the parametric form that has been selected is close 
to the true form of f. Figure 3.17 provides an example with data generated 
from a one-dimensional linear regression model. The black solid lines rep¬ 
resent f(X ), while the blue curves correspond to the KNN fits using K = 1 
and K = 9. In this case, the K = 1 predictions are far too variable, while 
the smoother K = 9 fit is much closer to /(A). However, since the true 
relationship is linear, it is hard for a non-parametric approach to compete 
with linear regression: a non-parametric approach incurs a cost in variance 
that is not offset by a reduction in bias. The blue dashed line in the left- 
hand panel of Figure 3.18 represents the linear regression fit to the same 
data. It is almost perfect. The right-hand panel of Figure 3.18 reveals that 
linear regression outperforms KNN for this data. The green solid line, plot¬ 
ted as a function of 1/K, represents the test set mean squared error (MSE) 
for KNN. The KNN errors are well above the black dashed line, which is 
the test MSE for linear regression. When the value of K is large, then KNN 
performs only a little worse than least squares regression in terms of MSE. 
It performs far worse when K is small. 

In practice, the true relationship between X and Y is rarely exactly lin¬ 
ear. Figure 3.19 examines the relative performances of least squares regres¬ 
sion and KNN under increasing levels of non-linearity in the relationship 
between A' and Y. In the top row, the true relationship is nearly linear. 
In this case we see that the test MSE for linear regression is still superior 
to that of KNN for low values of K. However, for K > 4, KNN out¬ 
performs linear regression. The second row illustrates a more substantial 
deviation from linearity. In this situation, KNN substantially outperforms 
linear regression for all values of K. Note that as the extent of non-linearity 
increases, there is little change in the test set MSE for the non-parametric 
KNN method, but there is a large increase in the test set MSE of linear 
regression. 

Figures 3.18 and 3.19 display situations in which KNN performs slightly 
worse than linear regression when the relationship is linear, but much better 
than linear regression for non-linear situations. In a real life situation in 
which the true relationship is unknown, one might draw the conclusion that 
KNN should be favored over linear regression because it will at worst be 
slightly inferior than linear regression if the true relationship is linear, and 
may give substantially better results if the true relationship is non-linear. 
But in reality, even when the true relationship is highly non-linear, KNN 
may still provide inferior results to linear regression. In particular, both 
Figures 3.18 and 3.19 illustrate settings with p = 1 predictor. But in higher 
dimensions, KNN often performs worse than linear regression. 

Figure 3.20 considers the same strongly non-linear situation as in the 
second row of Figure 3.19, except that we have added additional noise 
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x x 

FIGURE 3.17. Plots of f(X) using KNN regression on a one-dimensional data 
set with 100 observations. The true relationship is given by the black solid line. 
Left: The blue curve corresponds to K = 1 and interpolates (i.e. passes directly 
through) the training data. Right: The blue curve corresponds to K = 9, and 
represents a smoother fit. 



FIGURE 3.18. The same data set shown in Figure 3.17 is investigated further. 
Left: The blue dashed line is the least squares fit to the data. Since f(X) is in 
fact linear (displayed as the black line), the least squares regression line provides 
a very good estimate of f(X). Right: The dashed horizontal line represents the 
least squares test set MSE, while the green solid line corresponds to the MSE 
for KNN as a function of 1/K (on the log scale). Linear regression achieves a 
lower test MSE than does KNN regression, since f(X) is in fact linear. For KNN 
regression, the best results occur with a very large value of K, corresponding to a 
small value of 1/K. 
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FIGURE 3.19. Top Left: In a setting with a slightly non-linear relationship 
between X and Y (solid black line), the KNN fits with K = 1 (blue) and K = 9 
(red) are displayed. Top Right: For the slightly non-linear data, the test set MSE 
for least squares regression (horizontal black) and KNN with various values of 
1 /K (green) are displayed. Bottom Left and Bottom Right: As in the top panel, 
but with a strongly non-linear relationship between X and Y. 


predictors that are not associated with the response. When p = 1 or p = 2. 
KNN outperforms linear regression. But for p = 3 the results are mixed, 
and for p > 4 linear regression is superior to KNN. In fact, the increase in 
dimension has only caused a small deterioration in the linear regression test 
set MSE, but it has caused more than a ten-fold increase in the MSE for 
KNN. This decrease in performance as the dimension increases is a common 
problem for KNN, and results from the fact that in higher dimensions 
there is effectively a reduction in sample size. In this data set there are 
100 training observations; when p = 1, this provides enough information to 
accurately estimate f(X). However, spreading 100 observations overp = 20 
dimensions results in a phenomenon in which a given observation has no 
nearby neighbors —this is the so-called curse of dimensionality. That is, 
the K observations that are nearest to a given test observation xo may be 
very far away from Xq in p-dimensional space when p is large, leading to a 


curse of di¬ 
mensionality 
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p=i 


P=2 


p=3 


P=4 


p=10 


p=20 



1/K 


FIGURE 3.20. Test MSE for linear regression (black dashed lines) and KNN 
(green curves) as the number of variables p increases. The true function is non¬ 
linear in the first variable, as in the lower panel in Figure 3.19, and does not 
depend on the additional variables. The performance of linear regression deteri¬ 
orates slowly in the presence of these additional noise variables, whereas KNN’s 
performance degrades much more quickly as p increases. 


very poor prediction of f(xo) and hence a poor KNN fit. As a general rule, 
parametric methods will tend to outperform non-parametric approaches 
when there is a small number of observations per predictor. 

Even in problems in which the dimension is small, we might prefer linear 
regression to KNN from an interpretability standpoint. If the test MSE 
of KNN is only slightly lower than that of linear regression, we might be 
willing to forego a little bit of prediction accuracy for the sake of a simple 
model that can be described in terms of just a few coefficients, and for 
which p-values are available. 


3.6 Lab: Linear Regression 

3.6.1 Libraries 

The library () function is used to load libraries, or groups of functions and 
data sets that are not included in the base R distribution. Basic functions 
that perform least squares linear regression and other simple analyses come 
standard with the base distribution, but more exotic functions require ad¬ 
ditional libraries. Here we load the MASS package, which is a very large 
collection of data sets and functions. We also load the ISLR package, which 
includes the data sets associated with this book. 

> library (MASS) 

> library(ISLR) 

If you receive an error message when loading any of these libraries, it 
likely indicates that the corresponding library has not yet been installed 
on your system. Some libraries, such as MASS, come with R and do not need to 
be separately installed on your computer. However, other packages, such as 


library() 














110 


3. Linear Regression 


ISLR, must be downloaded the first time they are used. This can be done di¬ 
rectly from within R. For example, on a Windows system, select the Install 
package option under the Packages tab. After you select any mirror site, a 
list of available packages will appear. Simply select the package you wish to 
install and R will automatically download the package. Alternatively, this 
can be done at the R command line via install, packages ("ISLR"). This in¬ 
stallation only needs to be done the first time you use a package. However, 
the library () function must be called each time you wish to use a given 
package. 


3.6.2 Simple Linear Regression 

The MASS library contains the Boston data set, which records medv (median 
house value) for 506 neighborhoods around Boston. We will seek to predict 
medv using 13 predictors such as rm (average number of rooms per house), 
age (average age of houses), and lstat (percent of households with low 
socioeconomic status). 

> fix(Boston) 

> names(Boston) 

[1] "crim" "zn" "indus" "chas" "nox" "rm" "age" 

[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv" 

To find out more about the data set, we can type ?Boston. 

We will start by using the lm() function to fit a simple linear regression 
model, with medv as the response and lstat as the predictor. The basic 
syntax is lm(y~x,data) , where y is the response, x is the predictor, and 
data is the data set in which these two variables are kept. 

> lm.fit=lm(medv~lstat) 

Error in eval(expr , envir , enclos) : Object "medv" not found 

The command causes an error because R does not know where to find 
the variables medv and lstat. The next line tells R that the variables are 
in Boston. If we attach Boston, the first line works fine because R now 
recognizes the variables. 

> lm.fit=lm(medv~lstat,data=Boston) 

> attach(Boston) 

> lm.fit=lm(medv~lstat) 

If we type lm.fit, some basic information about the model is output. 
For more detailed information, we use summary (lm.fit). This gives us p- 
values and standard errors for the coefficients, as well as the R 2 statistic 
and F-statistic for the model. 

> lm.fit 
Call : 

lm(formula = medv ~ lstat) 


lm() 
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Coef ficients: 

(Intercept) lstat 

34.55 -0.95 


> summary(lm.f it ) 

Call : 

lm(formula = medv ~ lstat) 

Residuals : 

Min IQ Median 3Q Max 

-15.17 -3.99 -1.32 2.03 24.50 

Coef ficients: 

Estimate Std. Error t value P r ( >It I ) 

(Intercept) 34.5538 0.5626 61.4 <2e-16 *** 

lstat -0.9500 0.0387 -24.5 <2e-16 *** 

Signif . codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 

Residual standard error: 6.22 on 504 degrees of freedom 
Multiple R-squared: 0.544, Adjusted R-squared: 0.543 

F-statistic: 602 on 1 and 504 DF , p-value : <2e-16 

We can use the names () function in order to find out what other pieces 
of information are stored in lm.fit. Although we can extract these quan¬ 
tities by name—e.g. lm.fit$coefficients- it is safer to use the extractor 
functions like coef () to access them. 

> name s(lm.fit) 


[1] 

"coefficients" 

"residuals " 

"effects 

[4] 

"rank" 

"fitted.values " 

"assign " 

[7] 

"qr" 

"df.residual " 

"xlevels 

[10] 

"call" 

"terms " 

"model" 


> coef(lm.fit) 

(Intercept) lstat 

34.55 -0.95 

In order to obtain a confidence interval for the coefficient estimates, we can 
use the confintO command. 

> conf int (lm . f it ) 

2.5 ’/. 97.5 % 

(Intercept) 33.45 35.659 
lstat -1.03 -0.874 

The predict () function can be used to produce confidence intervals and 
prediction intervals for the prediction of medv for a given value of lstat. 

> predict(lm.fit,data.frame(lstat = (c(5,10,15))) , 

interval="confidence") 
fit lwr upr 

1 29.80 29.01 30.60 

2 25.05 24.47 25.63 

3 20.30 19.73 20.87 


names() 


coef() 


confint() 


predict() 
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> predict(lm.fit,data.frame(lstat = (c(5,10,15))), 

interval="prediction") 
fit lwr upr 

1 29.80 17.566 42.04 

2 25.05 12.828 37.28 

3 20.30 8.078 32.53 

For instance, the 95 % confidence interval associated with a lstat value of 
10 is (24.47,25.63), and the 95% prediction interval is (12.828,37.28). As 
expected, the confidence and prediction intervals are centered around the 
same point (a predicted value of 25.05 for medv when lstat equals 10), but 
the latter are substantially wider. 

We will now plot medv and lstat along with the least squares regression 
line using the plotO and ablineO functions. 

> plot(lstat,medv) 

> abline(lm.fit) 

There is some evidence for non-linearity in the relationship between lstat 
and medv. We will explore this issue later in this lab. 

The ablineO function can be used to draw any line, not just the least 
squares regression line. To draw a line with intercept a and slope b, we 
type abline (a,b). Below we experiment with some additional settings for 
plotting lines and points. The lwd=3 command causes the width of the 
regression fine to be increased by a factor of 3; this works for the plotO 
and lines () functions also. We can also use the pch option to create different 
plotting symbols. 

> abline ( lm . f it , lwd =3) 

> abline(lm.fit,lwd=3,col="red") 

> plot(1stat , medv ,col = "red " ) 

> plot(lstat,medv,pch=20) 

> plot(lstat,medv,pch="+") 

> plot(1:20,1:20,pch=l:20) 

Next we examine some diagnostic plots, several of which were discussed 
in Section 3.3.3. Four diagnostic plots are automatically produced by ap¬ 
plying the plotO function directly to the output from lm(). In general, this 
command will produce one plot at a time, and hitting Enter will generate 
the next plot. However, it is often convenient to view all four plots together. 
We can achieve this by using the par() function, which tells R to split the 
display screen into separate panels so that multiple plots can be viewed si¬ 
multaneously. For example, par(mfrow=c(2,2)) divides the plotting region 
into a 2 x 2 grid of panels. 

> par(mfrow=c(2,2)) 

> plot ( lm . f it ) 

Alternatively, we can compute the residuals from a linear regression fit 
using the residuals() function. The function rstudentO will return the 
studentized residuals, and we can use this function to plot the residuals 
against the fitted values. 


ablineO 


par() 


residuals() 
rstudent() 
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> plot(predict(lm.fit) , residuals(lm.fit)) 

> plot(predict(lm.fit), rstudent(lm.fit)) 

On the basis of the residual plots, there is some evidence of non-linearity. 
Leverage statistics can be computed for any number of predictors using the 
hatvaluesO function. 

> plot(hatvalues(lm.fit)) 

> which.max(hatvalues(lm.fit)) 

375 

The which.max0 function identifies the index of the largest element of a 
vector. In this case, it tells us which observation has the largest leverage 
statistic. 


3.6.3 Multiple Linear Regression 

In order to fit a multiple linear regression model using least squares, we 
again use the lm() function. The syntax Im(y~xl+x2+x3) is used to fit a 
model with three predictors, xl, x2, and x3. The summary!) function now 
outputs the regression coefficients for all the predictors. 

> lm.fit = lm(medv^lstat + age,data = Boston) 

> summary(lm.fit ) 

Call : 

lm(formula = medv ~ lstat + age, data = Boston) 

Residuals : 

Min IQ Median 3Q Max 

-15.98 -3.98 -1.28 1.97 23.16 

Coef ficients: 

Estimate Std. Error t value 
(Intercept) 33.2228 0.7308 45.46 

lstat -1.0321 0.0482 -21.42 

age 0.0345 0.0122 2.83 

Signif . codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 

Residual standard error: 6.17 on 503 degrees of freedom 
Multiple R-squared: 0.551, Adjusted R-squared: 0.549 

F-statistic: 309 on 2 and 503 DF , p-value : <2e-16 

The Boston data set contains 13 variables, and so it would be cumbersome 
to have to type all of these in order to perform a regression using all of the 
predictors. Instead, we can use the following short-hand: 

> lm.fit = lm(medv~. ,data = Boston) 

> summary(lm.fit ) 

Call : 

lm(formula = medv ~ data = Boston) 


PrOlt |) 

<2e-16 *** 
<2e-16 *** 
0.0049 ** 


hatvalues() 


which.max() 
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Residuals : 


Min 

IQ 

Me 

di an 


3Q 


Max 




-15.594 

-2.730 

-0 

.518 


1.777 


26.199 




Coeffici 

ent s : 












E 

Istimi 

ate 

Std 

. Err 

or 

t value 

P: 

rOltl) 


(Interce 

pt ) 3 . 

646 e 

+01 

5 . 

103e + 

00 

7 

. 144 

3 

.28e-12 

*** 

cr im 

-1 . 

080 e 

-01 

3 . 

286e- 

02 

-3 

. 287 

0 

.001087 

* * 

zn 

4 . 

642 e 

-02 

1 . 

373e- 

02 

3 

. 382 

0 

.000778 

* * * 

indus 

2 . 

056e 

-02 

6 . 

150 e - 

02 

0 

. 334 

0 

.738288 


chas 

2 . 

687 e 

+ 00 

8 . 

616 e - 

01 

3 

. 118 

0 

.001925 

* * 

nox 

-1 . 

777 e 

+ 01 

3 . 

00 

to 

o 

CD 

+ 

00 

-4 

. 651 

4 

.25e-06 

* * * 

rm 

3 . 

810e 

+ 00 

4 . 

179e- 

01 

9 

. 116 


< 2e-16 

* * * 

age 

6 . 

922e 

-04 

1 . 

321e- 

02 

0 

. 052 

0 

.958229 


dis 

-1 . 

476 e 

+ 00 

1 . 

995e- 

01 

-7 

. 398 

6 

.Ole-13 

* * * 

rad 

3 . 

060e 

-01 

6 . 

635e- 

02 

4 

.613 

5 

.07e-06 

* * * 

tax 

-1 . 

233 e 

-02 

3 . 

761e- 

03 

-3 

. 280 

0 

.001112 

* * 

ptratio 

-9 . 

527 e 

-01 

1 . 

308e- 

01 

-7 

.283 

1 

.31e-12 

*** 

black 

9 . 

312e 

-03 

2 . 

686 e - 

03 

3 

. 467 

0 

.000573 

* * * 

lstat 

-5 . 

248 e 

-01 

5 . 

072 e - 

02 

-10 

. 347 


< 2e-16 

* * * 

Signif . 

codes : 

0 ‘ 

* * * 5 

0 . 

001 ‘ 

** 

’ 0.01 ‘ 

* ’ 

0.05 1 , 

.> 0.1 ‘ 

Residual 

standard e 

rror 

: 4 

:.745 

on 

492 

deg: 

rei 

3 s of freedom 

Multiple 

R-Squared : 

0.7406 

, 

Adjusted ! 

R- 

squared: 

: 0.7338 


F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16 

We can access the individual components of a summary object by name 
(type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq 
gives us the R 2 , and summary(lm.fit)$sigma gives us the RSE. The vif() 
function, part of the car package, can be used to compute variance inflation 
factors. Most VIF’s are low to moderate for this data. The car package is 
not part of the base R installation so it must be downloaded the first time 
you use it via the install.packages option in R. 

> library(car) 

> vif (lm . f it) 


cr im 

zn 

indus 

chas 

nox 

rm 

age 

1.79 

2.30 

3.99 

1.07 

4.39 

1.93 

3.10 

dis 

rad 

tax 

ptratio 

black 

lstat 


3.96 

7.48 

9.01 

1.80 

1.35 

2.94 



What if we would like to perform a regression using all of the variables but 
one? For example, in the above regression output, age has a high p-value. 
So we may wish to run a regression excluding this predictor. The following 
syntax results in a regression using all predictors except age. 

> lm.fitl=lm(medv~.-age,data=Boston) 

> summary(lm.fit 1) 


vif 0 


Alternatively, the update () function can be used. 


update () 
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> lm.fit 1 = update(lm.fit , ~.-age) 


3.6.4 Interaction Terms 

It is easy to include interaction terms in a linear model using the lm() func¬ 
tion. The syntax lstat: black tells R to include an interaction term between 
lstat and black. The syntax lstat*age simultaneously includes lstat, age, 
and the interaction term lstat x age as predictors; it is a shorthand for 
lstat+age+lstat:age. 

> summary (lm (medv~lstat *age , data = Boston)) 

Call : 

lm(formula = medv ~ lstat * age, data = Boston) 

Residuals : 

Min IQ Median 3Q Max 

-15.81 -4.04 -1.33 2.08 27.55 

Coefficients: 



Estimate 

Std. Error t 

value 

PrOltl) 

(Intercept ) 

36.088536 

1.469835 

24.55 

< 2e-16 *** 

lstat 

-1.392117 

0.167456 

-8.31 

8.8e-16 *** 

age 

-0.000721 

0.019879 

•sF 

O 

o 

1 

0.971 

lstat:age 

0.004156 

0.001852 

2.24 

0.025 * 


Signif . codes: 0 ’***’ 0.001 ’ ** ’ 0.01 5 * 5 0.05 0.1 » » 1 

Residual standard error: 6.15 on 502 degrees of freedom 
Multiple R-squared: 0.556, Adjusted R-squared: 0.553 
F-statistic: 209 on 3 and 502 DF , p-value : <2e-16 


3.6.5 Non-linear Transformations of the Predictors 

The lm() function can also accommodate non-linear transformations of the 
predictors. For instance, given a predictor X, we can create a predictor X 2 
using I (X~2) . The function I () is needed since the ~ has a special meaning 
in a formula; wrapping as we do allows the standard usage in R, which is 
to raise X to the power 2. We now perform a regression of medv onto lstat 
and lstat 2 . 

> lm.fit2 = lm(medv~lstat + 1(lstat~2)) 

> summary(lm.fit2) 

Call : 

lm(formula = medv ~ lstat + I(lstat~2)) 

Residuals : 

Min IQ Median 3Q Max 

-15.28 -3.83 -0.53 2.31 25.41 
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Coefficients: 

Estimate Std. Error t value P r (>11 I ) 

(Intercept) 42.86201 0.87208 49.1 <2e-16 *** 

lstat -2.33282 0.12380 -18.8 <2e-16 *** 

I(lstat‘2) 0.04355 0.00375 11.6 <2e-16 *** 

Signif. codes: 0 >***’ 0.001 ’**’ 0.01 0.05 0.1 > ' 1 

Residual standard error: 5.52 on 503 degrees of freedom 
Multiple R-squared: 0.641, Adjusted R-squared: 0.639 
F-statistic: 449 on 2 and 503 DF, p-value: <2e-16 

The near-zero p-value associated with the quadratic term suggests that 
it leads to an improved model. We use the anovaO function to further 
quantify the extent to which the quadratic fit is superior to the linear fit. 

> lm.fit=lm(medv~lstat) 


> ano 

va ( 

lm.fit 

, lm . 

.fit2) 



Analy 

sis 

of Variance Table 


Model 

1: 

medv - 

~ 1 i 

3t at 



Model 

2 : 

medv - 

~ 1 i 

3t at + 

I(lstat 

•2) 

Res 

. Df 

RSS 

Df 

Sum of 

Sq F 

Pr(>F) 

1 

504 

19472 





2 

503 

15347 

1 

4 

125 135 

<2e-16 *** 

Signi 

f . 

codes: 

0 

} * * * 1 

0.001 > 

**’ 0.01 ’ * 


Here Model 1 represents the linear submodel containing only one predictor, 
lstat, while Model 2 corresponds to the larger quadratic model that has two 
predictors, lstat and lstat 2 . The anovaO function performs a hypothesis 
test comparing the two models. The null hypothesis is that the two models 
fit the data equally well, and the alternative hypothesis is that the full 
model is superior. Here the F-statistic is 135 and the associated p-value is 
virtually zero. This provides very clear evidence that the model containing 
the predictors lstat and lstat 2 is far superior to the model that only 
contains the predictor lstat. This is not surprising, since earlier we saw 
evidence for non-linearity in the relationship between medv and lstat. If we 
type 

> par(mfrow=c(2,2)) 

> plot(lm.fit2) 

then we see that when the lstat 2 term is included in the model, there is 
little discernible pattern in the residuals. 

In order to create a cubic fit, we can include a predictor of the form 
I(X~3). However, this approach can start to get cumbersome for higher- 
order polynomials. A better approach involves using the polyO function ^ ^ 
to create the polynomial within lm(). For example, the following command 
produces a fifth-order polynomial fit: 
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> lm.fit5 = lm(medv~poly(lstat , 5)) 

> summary ( lm . f it 5 ) 


Call : 

lm(f ormula 

= medv ~ poly 

(lstat , 

5)) 

Residuals : 

Min 

IQ Median 

3Q 

Max 


-13.543 -3.104 -0.705 2.084 27.115 

Coefficients: 


(Intercept ) 


Estimate Std . 

22.533 

Error t value 

0.232 97.20 

PrOltl) 

< 2e-16 

* * * 

poly(lstat , 

5) 1 

-152.460 

5.215 

-29.24 

< 2e-16 

* * * 

poly(lstat, 

5) 2 

64.227 

5.215 

12.32 

< 2e-16 

*** 

poly(lstat , 

5)3 

-27.051 

5.215 

-5.19 

3.1e-07 

*** 

poly(lstat , 

5)4 

25.452 

5.215 

4.88 

1,4e-06 

*** 

poly(lstat , 

5) 5 

-19.252 

5.215 

-3.69 

0.00025 

*** 

Signif . codes : 

0 ’ *** ’ 0.001 

» ** 5 0.01 5 * ’ 

0.05 > . > 

0.1 


Residual standard error: 5.21 on 500 degrees of freedom 
Multiple R-squared: 0.682, Adjusted R-squared: 0.679 
F-statistic: 214 on 5 and 500 DF, p-value: <2e-16 


This suggests that including additional polynomial terms, up to fifth order, 
leads to an improvement in the model fit! However, further investigation of 
the data reveals that no polynomial terms beyond fifth order have signifi¬ 
cant p-values in a regression fit. 

Of course, we are in no way restricted to using polynomial transforma¬ 
tions of the predictors. Here we try a log transformation. 

> summary(lm(medv~log(rm),data=Boston)) 


3.6.6 Qualitative Predictors 

We will now examine the Carseats data, which is part of the ISLR library. 
We will attempt to predict Sales (child car seat sales) in 400 locations 
based on a number of predictors. 

> fix(Carseats) 

> names(Carseats) 


[1] 

"Sales" 

"CompPrice " 

"Income " 

"Advertising " 

[5] 

"Population" 

"Price " 

"ShelveLoc " 

"Age" 

[9] 

"Education" 

"Urban" 

"US" 



The Carseats data includes qualitative predictors such as Shelveloc, an in¬ 
dicator of the quality of the shelving location—that is, the space within 
a store in which the car seat is displayed—at each location. The pre¬ 
dictor Shelveloc takes on three possible values, Bad Medium, and Good. 
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Given a qualitative variable such as Shelveloc, R generates dummy variables 
automatically. Below we fit a multiple regression model that includes some 
interaction terms. 

> lm . f it = lm (Sales~. + Incorae : Advertising+ Price : Age , dat a = Car seat s ) 

> summary(lm.fit ) 


Call : 

lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = 
Carseat s) 


Residuals : 

Min IQ Median 3Q Max 

-2.921 -0.750 0.018 0.675 3.341 

Coef ficients: 



Estimate 

Std. Error 

t value 

PrOlt |) 


(Intercept) 

6.575565 

1.008747 

6.52 

2.2e-10 

* * * 

CompPrice 

0.092937 

0.004118 

22.57 

< 2e-16 

* * * 

Income 

0.010894 

0.002604 

4.18 

3.6e-05 

* * * 

Advertising 

0.070246 

0.022609 

3.11 

0.00203 

* * 

Population 

0.000159 

0.000368 

0.43 

0.66533 


Price 

-0.100806 

0.007440 

-13.55 

< 2e-16 

* * * 

ShelveLocGood 

4.848676 

0.152838 

31.72 

< 2e-16 

* * * 

ShelveLocMedium 

1.953262 

0.125768 

15.53 

< 2e-16 

* * * 

Age 

-0.057947 

0.015951 

-3.63 

0.00032 

* * * 

Education 

-0.020852 

0.019613 

-1.06 

0.28836 


UrbanYes 

0.140160 

0.112402 

1.25 

0.21317 


USYes 

-0.157557 

0.148923 

-1.06 

0.29073 


Income:Advertisi 

ng 0.000751 

0.000278 

2.70 

0.00729 

** 

Price:Age 

0.000107 

0.000133 

0.80 

0.42381 


Signif. codes: 

0 1 *** ’ 0.001 

’**’ 0.01 

0.05 

’.> 0.1 

J ) 

Residual standard error: 1.01 

on 386 deg 

rees of 

freedom 


Multiple R-squared: 0.876, 

Adjusted 

R-squared: 0.872 



F-statistic: 210 on 13 and 386 DF, p-value: <2e-16 


The contrasts () function returns the coding that R uses for the dummy 
variables. 


contrasts() 


> attach(Carseats) 

> contrasts(ShelveLoc) 

Good Medium 


Bad 0 0 

Good 1 0 

Medium 0 1 


Use ?contrasts to learn about other contrasts, and how to set them. 

R has created a ShelveLocGood dummy variable that takes on a value of 
1 if the shelving location is good, and 0 otherwise. It has also created a 
ShelveLocMedium dummy variable that equals 1 if the shelving location is 
medium, and 0 otherwise. A bad shelving location corresponds to a zero 
for each of the two dummy variables. The fact that the coefficient for 
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ShelveLocGood in the regression output is positive indicates that a good 
shelving location is associated with high sales (relative to a bad location). 
And ShelveLocMedium has a smaller positive coefficient, indicating that a 
medium shelving location leads to higher sales than a bad shelving location 
but lower sales than a good shelving location. 


3.6.1 Writing Functions 

As we have seen, R comes with many useful functions, and still more func¬ 
tions are available by way of R libraries. However, we will often be inter¬ 
ested in performing an operation for which no function is available. In this 
setting, we may want to write our own function. For instance, below we 
provide a simple function that reads in the I SLR and MASS libraries, called 
LoadLibrariesO. Before we have created the function, R returns an error if 
we try to call it. 

> LoadLibraries 

Error: object *LoadLibraries 1 not found 

> LoadLibrariesO 

Error: could not find function "LoadLibraries" 


We now create the function. Note that the + symbols are printed by R and 
should not be typed in. The { symbol informs R that multiple commands 
are about to be input. Hitting Enter after typing { will cause R to print the 
+ symbol. We can then input as many commands as we wish, hitting Enter 
after each one. Finally the } symbol informs R that no further commands 
will be entered. 

> LoadLibraries = function (){ 

+ library(ISLR) 

+ library(MASS) 

+ print("The libraries have been loaded.") 

+ } 

Now if we type in LoadLibrar ies, R will tell us what is in the function. 

> LoadLibraries 
function (){ 
library(ISLR) 
library(MASS) 

print("The libraries have been loaded.") 

> 

If we call the function, the libraries are loaded in and the print statement 
is output. 

> LoadLibrariesO 

[1] "The libraries have been loaded." 
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3.7 Exercises 

Conceptual 

1. Describe the null hypotheses to which the p-values given in Table 3.4 
correspond. Explain what conclusions you can draw based on these 
p-values. Your explanation should be phrased in terms of sales, TV, 
radio, and newspaper, rather than in terms of the coefficients of the 
linear model. 

2. Carefully explain the differences between the KNN classifier and KNN 
regression methods. 

3. Suppose we have a data set with five predictors, X\ = GPA, X 2 — IQ, 
X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction between 
GPA and IQ, and X 5 = Interaction between GPA and Gender. The 
response is starting salary after graduation (in thousands of dollars). 
Suppose we use least squares to fit the model, and get j3 0 = 50, /3i = 
20, /3 2 = 0.07, ^3 = 35, ^ = 0.01, & = -10. 

(a) Which answer is correct, and why? 

i. For a fixed value of IQ and GPA, males earn more on average 
than females. 

ii. For a fixed value of IQ and GPA, females earn more on 
average than males. 

iii. For a fixed value of IQ and GPA, males earn more on average 
than females provided that the GPA is high enough. 

iv. For a fixed value of IQ and GPA, females earn more on 
average than males provided that the GPA is high enough. 

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0. 

(c) True or false: Since the coefficient for the GPA/IQ interaction 
term is very small, there is very little evidence of an interaction 
effect. Justify your answer. 

4. I collect a set of data (n = 100 observations) containing a single 
predictor and a quantitative response. I then fit a linear regression 
model to the data, as well as a separate cubic regression, i.e. Y = 
/So + PiX + P 2 X 2 + P 3 X 3 + e. 

(a) Suppose that the true relationship between X and Y is linear, 
i.e. Y = /3q + P\X + e. Consider the training residual sum of 
squares (RSS) for the linear regression, and also the training 
RSS for the cubic regression. Would we expect one to be lower 
than the other, would we expect them to be the same, or is there 
not enough information to tell? Justify your answer. 
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(b) Answer (a) using test rather than training RSS. 

(c) Suppose that the true relationship between X and Y is not linear, 
but we don’t know how far it is from linear. Consider the training 
RSS for the linear regression, and also the training RSS for the 
cubic regression. Would we expect one to be lower than the 
other, would we expect them to be the same, or is there not 
enough information to tell? Justify your answer. 

(d) Answer (c) using test rather than training RSS. 

5. Consider the fitted values that result from performing linear regres¬ 
sion without an intercept. In this setting, the *th fitted value takes 
the form 

Vi = Xil3, 

where 

(3-38) 

Show that we can write 

n 

Vi = ^ ai'yi'. 

i' = l 


Me 


Xiyi 


What is ai'l 


Note: We interpret this result by saying that the fitted values from 
linear regression are linear combinations of the response values. 

6 . Using (3.4), argue that in the case of simple linear regression, the 
least squares line always passes through the point (x,y). 

7. It is claimed in the text that in the case of simple linear regression 
of Y onto X. the R 2 statistic (3.17) is equal to the square of the 
correlation between X and Y (3.18). Prove that this is the case. For 
simplicity, you may assume that x = y = 0. 


Applied 

8. This question involves the use of simple linear regression on the Auto 
data set. 

(a) Use the lm() function to perform a simple linear regression with 
mpg as the response and horsepower as the predictor. Use the 
summary () function to print the results. Comment on the output. 
For example: 
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i. Is there a relationship between the predictor and the re¬ 
sponse? 

ii. How strong is the relationship between the predictor and 
the response? 

iii. Is the relationship between the predictor and the response 
positive or negative? 

iv. What is the predicted mpg associated with a horsepower of 
98? What are the associated 95 % confidence and prediction 
intervals? 

(b) Plot the response and the predictor. Use the ablineO function 
to display the least squares regression line. 

(c) Use the plot() function to produce diagnostic plots of the least 
squares regression fit. Comment on any problems you see with 
the fit. 

9. This question involves the use of multiple linear regression on the 
Auto data set. 

(a) Produce a scatterplot matrix which includes all of the variables 
in the data set. 

(b) Compute the matrix of correlations between the variables using 
the function cor(). You will need to exclude the name variable, 
which is qualitative. 

(c) Use the lm() function to perform a multiple linear regression 
with mpg as the response and all other variables except name as 
the predictors. Use the summary () function to print the results. 
Comment on the output. For instance: 

i. Is there a relationship between the predictors and the re¬ 
sponse? 

ii. Which predictors appear to have a statistically significant 
relationship to the response? 

iii. What does the coefficient for the year variable suggest? 

(d) Use the plotO function to produce diagnostic plots of the linear 
regression fit. Comment on any problems you see with the fit. 
Do the residual plots suggest any unusually large outliers? Does 
the leverage plot identify any observations with unusually high 
leverage? 

(e) Use the * and : symbols to fit linear regression models with 
interaction effects. Do any interactions appear to be statistically 
significant? 

(f) Try a few different transformations of the variables, such as 
log(X), y/X, X 2 . Comment on your findings. 


cor() 
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10. This question should be answered using the Carseats data set. 

(a) Fit a multiple regression model to predict Sales using Price, 
Urban, and US. 

(b) Provide an interpretation of each coefficient in the model. Be 
careful—some of the variables in the model are qualitative! 

(c) Write out the model in equation form, being careful to handle 
the qualitative variables properly. 

(d) For which of the predictors can you reject the null hypothesis 


Ho : fa = 0? 


(e) On the basis of your response to the previous question, fit a 
smaller model that only uses the predictors for which there is 
evidence of association with the outcome. 

(f) How well do the models in (a) and (e) fit the data? 

(g) Using the model from (e), obtain 95% confidence intervals for 
the coefficient (s). 

(h) Is there evidence of outliers or high leverage observations in the 
model from (e)? 

11. In this problem we will investigate the t-statistic for the null hypoth¬ 
esis Hq : /3 = 0 in simple linear regression without an intercept. To 
begin, we generate a predictor x and a response y as follows. 

> set.seed(1) 

> x = rnorm(100) 

> y =2* x+rnorm(100) 

(a) Perform a simple linear regression of y onto x, without an in¬ 
tercept. Report the coefficient estimate /3, the standard error of 
this coefficient estimate, and the t-statistic and p-value associ¬ 
ated with the null hypothesis Hq : /3 = 0. Comment on these 
results. (You can perform regression without an intercept using 
the command lm(y~x+0>.) 

(b) Now perform a simple linear regression of x onto y without an 
intercept, and report the coefficient estimate, its standard error, 
and the corresponding t-statistic and p-values associated with 
the null hypothesis Ho : /3 = 0. Comment on these results. 

(c) What is the relationship between the results obtained in (a) and 


(b)? 


(d) For the regression of Y onto X without an intercept, the t- 
statistic for H 0 : f3 = 0 takes the form /3/SE(/3), where /3 is 
given by (3.38), and where 
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(These formulas are slightly different from those given in Sec¬ 
tions 3.1.1 and 3.1.2, since here we are performing regression 
without an intercept.) Show algebraically, and confirm numeri¬ 
cally in R, that the t-statistic can be written as 


Wn - 1) EILi x iVi 


v^EEi )(Ei‘=i yf') - (El‘ =1 xw) 2 ' 


(e) Using the results from (d), argue that the t-statistic for the re¬ 
gression of y onto x is the same as the t-statistic for the regression 
of x onto y. 

(f) In R, show that when regression is performed with an intercept, 
the t-statistic for H 0 : f3 1 = 0 is the same for the regression of y 
onto x as it is for the regression of x onto y. 


12. This problem involves simple linear regression without an intercept. 

(a) Recall that the coefficient estimate $ for the linear regression of 

Y onto X without an intercept is given by (3.38). Under what 
circumstance is the coefficient estimate for the regression of X 
onto Y the same as the coefficient estimate for the regression of 

Y onto XI 

(b) Generate an example in R with n = 100 observations in which 
the coefficient estimate for the regression of X onto Y is different 
from the coefficient estimate for the regression of Y onto X. 

(c) Generate an example in R with n = 100 observations in which 
the coefficient estimate for the regression of X onto Y is the 
same as the coefficient estimate for the regression of Y onto X. 

13. In this exercise you will create some simulated data and will fit simple 
linear regression models to it. Make sure to use set.seed(l) prior to 
starting part (a) to ensure consistent results. 

(a) Using the rnormO function, create a vector, x, containing 100 
observations drawn from a N( 0,1) distribution. This represents 
a feature, X. 

(b) Using the rnormO function, create a vector, eps, containing 100 
observations drawn from a iV(0,0.25) distribution i.e. a normal 
distribution with mean zero and variance 0.25. 

(c) Using x and eps, generate a vector y according to the model 

y = —1 + 0.5X + e. (3.39) 

What is the length of the vector y? What are the values of /3 q 
and fS\ in this linear model? 
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(d) Create a scatterplot displaying the relationship between x and 
y. Comment on what you observe. 

(e) Fit a least squares linear model to predict y using x. Comment 
on the model obtained. How do ft and ft compare to ft and 

ft? 

(f) Display the least squares line on the scatterplot obtained in (d). 
Draw the population regression line on the plot, in a different 
color. Use the legend() command to create an appropriate leg¬ 
end. 

(g) Now fit a polynomial regression model that predicts y using x 
and x 2 . Is there evidence that the quadratic term improves the 
model fit? Explain your answer. 

(h) Repeat (a)-(f) after modifying the data generation process in 
such a way that there is less noise in the data. The model (3.39) 
should remain the same. You can do this by decreasing the vari¬ 
ance of the normal distribution used to generate the error term 
e in (b). Describe your results. 

(i) Repeat (a)-(f) after modifying the data generation process in 
such a way that there is more noise in the data. The model 
(3.39) should remain the same. You can do this by increasing 
the variance of the normal distribution used to generate the 
error term e in (b). Describe your results. 

(j) What are the confidence intervals for ft and ft based on the 
original data set, the noisier data set, and the less noisy data 
set? Comment on your results. 

This problem focuses on the collinearity problem. 

(a) Perform the following commands in R: 

> set.seed (1) 

> xl = runif (100) 

> x2=0.5*xl+rnorm(100)/10 

> y = 2 + 2*xl+0 . 3*x2 + rnorm ( 100) 

The last line corresponds to creating a linear model in which y is 
a function of xl and x2. Write out the form of the linear model. 
What are the regression coefficients? 

(b) What is the correlation between xl and x2? Create a scatterplot 
displaying the relationship between the variables. 

(c) Using this data, fit a least squares regression to predict y using 
xl and x2. Describe the results obtained. What are ft, ft, and 
ft? How do these relate to the true ft, ft, and ft? Can you 
reject the null hypothesis Hq : ft = 0? How about the null 
hypothesis H 0 : ft = 0? 


126 


3. Linear Regression 


(d) Now fit a least squares regression to predict y using only xl. 
Comment on your results. Can you reject the null hypothesis 

H 0 : fii = 0 ? 

(e) Now fit a least squares regression to predict y using only x2. 
Comment on your results. Can you reject the null hypothesis 

H 0 : /3i = 0? 

(f) Do the results obtained in (c)-(e) contradict each other? Explain 
your answer. 

(g) Now suppose we obtain one additional observation, which was 
unfortunately mismeasured. 

> xl=c(xl, o.i) 

> x2=c(x2, 0.8) 

> y=c(y,6) 

Re-fit the linear models from (c) to (e) using this new data. What 
effect does this new observation have on the each of the models? 
In each model, is this observation an outlier? A high-leverage 
point? Both? Explain your answers. 

15. This problem involves the Boston data set, which we saw in the lab 
for this chapter. We will now try to predict per capita crime rate 
using the other variables in this data set. In other words, per capita 
crime rate is the response, and the other variables are the predictors. 

(a) For each predictor, fit a simple linear regression model to predict 
the response. Describe your results. In which of the models is 
there a statistically significant association between the predictor 
and the response? Create some plots to back up your assertions. 

(b) Fit a multiple regression model to predict the response using 
all of the predictors. Describe your results. For which predictors 
can we reject the null hypothesis H 0 : (3j = 0? 

(c) How do your results from (a) compare to your results from (b)? 
Create a plot displaying the univariate regression coefficients 
from (a) on the x-axis, and the multiple regression coefficients 
from (b) on the y- axis. That is, each predictor is displayed as a 
single point in the plot. Its coefficient in a simple linear regres¬ 
sion model is shown on the a:-axis, and its coefficient estimate 
in the multiple linear regression model is shown on the y-axis. 

(d) Is there evidence of non-linear association between any of the 
predictors and the response? To answer this question, for each 
predictor X, fit a model of the form 


Y = f3 0 + PiX + foX 2 + /? 3 * 3 + e. 


4 

Classification 


The linear regression model discussed in Chapter 3 assumes that the re¬ 
sponse variable Y is quantitative. But in many situations, the response 
variable is instead qualitative. For example, eye color is qualitative, taking 
on values blue, brown, or green. Often qualitative variables are referred 
to as categorical ; we will use these terms interchangeably. In this chapter, 
we study approaches for predicting qualitative responses, a process that 
is known as classification. Predicting a qualitative response for an obser¬ 
vation can be referred to as classifying that observation, since it involves 
assigning the observation to a category, or class. On the other hand, often 
the methods used for classification first predict the probability of each of 
the categories of a qualitative variable, as the basis for making the classi¬ 
fication. In this sense they also behave like regression methods. 

There are many possible classification techniques, or classifiers , that one 
might use to predict a qualitative response. We touched on some of these 
in Sections 2.1.5 and 2.2.3. In this chapter we discuss three of the most 
widely-used classifiers: logistic regression , linear discriminant analysis, and 
K-nearest neighbors. We discuss more computer-intensive methods in later 
chapters, such as generalized additive models (Chapter 7), trees, random 
forests, and boosting (Chapter 8), and support vector machines (Chap¬ 
ter 9). 
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4.1 An Overview of Classification 

Classification problems occur often, perhaps even more so than regression 
problems. Some examples include: 

1. A person arrives at the emergency room with a set of symptoms 
that could possibly be attributed to one of three medical conditions. 
Which of the three conditions does the individual have? 

2. An online banking service must be able to determine whether or not 
a transaction being performed on the site is fraudulent, on the basis 
of the user’s IP address, past transaction history, and so forth. 

3. On the basis of DNA sequence data for a number of patients with 
and without a given disease, a biologist would like to figure out which 
DNA mutations are deleterious (disease-causing) and which are not. 

Just as in the regression setting, in the classification setting we have a 
set of training observations ..., (x n ,y n ) that we can use to build 

a classifier. We want our classifier to perform well not only on the training 
data, but also on test observations that were not used to train the classifier. 

In this chapter, we will illustrate the concept of classification using the 
simulated Default data set. We are interested in predicting whether an 
individual will default on his or her credit card payment, on the basis of 
annual income and monthly credit card balance. The data set is displayed 
in Figure 4.1. We have plotted annual income and monthly credit card 
balance for a subset of 10,000 individuals. The left-hand panel of Figure 4.1 
displays individuals who defaulted in a given month in orange, and those 
who did not in blue. (The overall default rate is about 3%, so we have 
plotted only a fraction of the individuals who did not default.) It appears 
that individuals who defaulted tended to have higher credit card balances 
than those who did not. In the right-hand panel of Figure 4.1, two pairs 
of boxplots are shown. The first shows the distribution of balance split by 
the binary default variable; the second is a similar plot for income. In this 
chapter, we learn how to build a model to predict default (Y) for any 
given value of balance (Xi) and income (X 2 ). Since Y is not quantitative, 
the simple linear regression model of Chapter 3 is not appropriate. 

It is worth noting that Figure 4.1 displays a very pronounced relation¬ 
ship between the predictor balance and the response default. In most real 
applications, the relationship between the predictor and the response will 
not be nearly so strong. However, for the sake of illustrating the classifica¬ 
tion procedures discussed in this chapter, we use an example in which the 
relationship between the predictor and the response is somewhat exagger¬ 
ated. 
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FIGURE 4.1. The Default data set. Left: The annual incomes and monthly 
credit card balances of a number of individuals. The individuals who defaulted on 
their credit card payments are shown in orange, and those who did not are shown 
in blue. Center: Boxplots of balance as a function of default status. Right: 
Boxplots of income as a function of default status. 

4.2 Why Not Linear Regression? 


We have stated that linear regression is not appropriate in the case of a 
qualitative response. Why not? 

Suppose that we are trying to predict the medical condition of a patient 
in the emergency room on the basis of her symptoms. In this simplified 
example, there are three possible diagnoses: stroke, drug overdose, and 
epileptic seizure. We could consider encoding these values as a quantita¬ 
tive response variable, Y, as follows: 

{ 1 if stroke; 

2 if drug overdose; 

3 if epileptic seizure. 

Using this coding, least squares could be used to fit a linear regression model 
to predict Y on the basis of a set of predictors X±,..., X p . Unfortunately, 
this coding implies an ordering on the outcomes, putting drug overdose in 
between stroke and epileptic seizure, and insisting that the difference 
between stroke and drug overdose is the same as the difference between 
drug overdose and epileptic seizure. In practice there is no particular 
reason that this needs to be the case. For instance, one could choose an 
equally reasonable coding, 

{ 1 if epileptic seizure; 

2 if stroke; 

3 if drug overdose. 
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which would imply a totally different relationship among the three condi¬ 
tions. Each of these codings would produce fundamentally different linear 
models that would ultimately lead to different sets of predictions on test 
observations. 

If the response variable’s values did take on a natural ordering, such as 
mild , moderate, and severe , and we felt the gap between mild and moderate 
was similar to the gap between moderate and severe, then a 1, 2, 3 coding 
would be reasonable. Unfortunately, in general there is no natural way to 
convert a qualitative response variable with more than two levels into a 
quantitative response that is ready for linear regression. 

For a binary (two level) qualitative response, the situation is better. For 
instance, perhaps there are only two possibilities for the patient’s med¬ 
ical condition: stroke and drug overdose. We could then potentially use 
the dummy variable approach from Section 3.3.1 to code the response as 
follows: 

! 0 if stroke; 

1 if drug overdose. 

We could then fit a linear regression to this binary response, and predict 
drug overdose if Y > 0.5 and stroke otherwise. In the binary case it is not 
hard to show that even if we flip the above coding, linear regression will 
produce the same final predictions. 

For a binary response with a 0/1 coding as above, regression by least 
squares does make sense; it can be shown that the X8 obtained using linear 
regression is in fact an estimate of Pr(drug overdose | X) in this special 
case. However, if we use linear regression, some of our estimates might be 
outside the [0,1] interval (see Figure 4.2), making them hard to interpret 
as probabilities! Nevertheless, the predictions provide an ordering and can 
be interpreted as crude probability estimates. Curiously, it turns out that 
the classifications that we get if we use linear regression to predict a binary 
response will be the same as for the linear discriminant analysis (LDA) 
procedure we discuss in Section 4.4. 

However, the dummy variable approach cannot be easily extended to 
accommodate qualitative responses with more than two levels. For these 
reasons, it is preferable to use a classification method that is truly suited 
for qualitative response values, such as the ones presented next. 


4.3 Logistic Regression 

Consider again the Default data set, where the response default falls into 
one of two categories, Yes or No. Rather than modeling this response Y 
directly, logistic regression models the probability that Y belongs to a par¬ 
ticular category. 


binary 
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FIGURE 4.2. Classification using the Default data. Left: Estimated probabil¬ 
ity of default using linear regression. Some estimated probabilities are negative! 
The orange ticks indicate the 0/1 values coded for default/No or Yes). Right: 
Predicted probabilities of default using logistic regression. All probabilities lie 
between 0 and 1. 

For the Default data, logistic regression models the probability of default. 
For example, the probability of default given balance can be written as 

Pr(default = Yes|balance). 

The values of Pr(default = Yes|balance), which we abbreviate 
p(balance), will range between 0 and 1. Then for any given value of balance, 
a prediction can be made for default. For example, one might predict 
default = Yes for any individual for whom p(balance) > 0.5. Alterna¬ 
tively, if a company wishes to be conservative in predicting individuals who 
are at risk for default, then they may choose to use a lower threshold, such 
as p(balance) > 0.1. 


/. 3.1 The Logistic Model 

How should we model the relationship between p{X) = Pr(T = 1|A) and 
XI (For convenience we are using the generic 0/1 coding for the response). 
In Section 4.2 we talked of using a linear regression model to represent 
these probabilities: 

p(X)=p Q +0 1 X. (4.1) 

If we use this approach to predict default=Yes using balance, then we 
obtain the model shown in the left-hand panel of Figure 4.2. Here we see 
the problem with this approach: for balances close to zero we predict a 
negative probability of default; if we were to predict for very large balances, 
we would get values bigger than 1. These predictions are not sensible, since 
of course the true probability of default, regardless of credit card balance, 
must fall between 0 and 1. This problem is not unique to the credit default 
data. Any time a straight line is fit to a binary response that is coded as 
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0 or 1, in principle we can always predict p{X) < 0 for some values of X 
and p(X) > 1 for others (unless the range of X is limited). 

To avoid this problem, we must model p(X) using a function that gives 
outputs between 0 and 1 for all values of X. Many functions meet this 
description. In logistic regression, we use the logistic function , 

p0o+PiX 

p ^ = i + e Po+fhx ' ( 4 - 2 ) 

To fit the model (4.2), we use a method called maximum likelihood , which 
we discuss in the next section. The right-hand panel of Figure 4.2 illustrates 
the fit of the logistic regression model to the Default data. Notice that for 
low balances we now predict the probability of default as close to, but never 
below, zero. Likewise, for high balances we predict a default probability 
close to, but never above, one. The logistic function will always produce 
an S-shaped curve of this form, and so regardless of the value of A", we 
will obtain a sensible prediction. We also see that the logistic model is 
better able to capture the range of probabilities than is the linear regression 
model in the left-hand plot. The average fitted probability in both cases is 
0.0333 (averaged over the training data), which is the same as the overall 
proportion of defaulters in the data set. 

After a bit of manipulation of (4.2), we find that 


P( X ) = P 0o+0iX 

1 ~P(X) 


(4.3) 


The quantity p(X)/[l — p{X)\ is called the odds, and can take on any value 
between 0 and oo. Values of the odds close to 0 and oo indicate very low 
and very high probabilities of default, respectively. For example, on average 
1 in 5 people with an odds of 1/4 will default, since p(X) = 0.2 implies an 
odds of = 4 / 4 - Likewise on average nine out of every ten people with 
an odds of 9 will default, since p{X) = 0.9 implies an odds of yryg = 9. 
Odds are traditionally used instead of probabilities in horse-racing, since 
they relate more naturally to the correct betting strategy. 

By taking the logarithm of both sides of (4.3), we arrive at 


bg (r§y = ' 3o+AX (4 - 4) 

The left-hand side is called the log-odds or logit. We see that the logistic 
regression model (4.2) has a logit that is linear in X. 

Recall from Chapter 3 that in a linear regression model, ft\ gives the 
average change in Y associated with a one-unit increase in A. In contrast, 
in a logistic regression model, increasing X by one unit changes the log odds 
by j3\ (4.4), or equivalently it multiplies the odds by e^ 1 (4.3). However, 
because the relationship between p{X) and X in (4.2) is not a straight line, 
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/3i does not correspond to the change in p{X) associated with a one-unit 
increase in A". The amount that p(X) changes due to a one-unit change in 
X will depend on the current value of X. But regardless of the value of X, 
if /3i is positive then increasing X will be associated with increasing p(X), 
and if /3i is negative then increasing X will be associated with decreasing 
p(X). The fact that there is not a straight-line relationship between p{X) 
and X, and the fact that the rate of change in p(X) per unit change in X 
depends on the current value of X, can also be seen by inspection of the 
right-hand panel of Figure 4.2. 

4-3.2 Estimating the Regression Coefficients 

The coefficients /?o and /3\ in (4.2) are unknown, and must be estimated 
based on the available training data. In Chapter 3, we used the least squares 
approach to estimate the unknown linear regression coefficients. Although 
we could use (non-linear) least squares to fit the model (4.4), the more 
general method of maximum likelihood is preferred, since it has better sta¬ 
tistical properties. The basic intuition behind using maximum likelihood 
to fit a logistic regression model is as follows: we seek estimates for /3o and 
/3i such that the predicted probability p{xi ) of default for each individual, 
using (4.2), corresponds as closely as possible to the individual’s observed 
default status. In other words, we try to find /3o and j3\ such that plugging 
these estimates into the model for p(X), given in (4.2), yields a number 
close to one for all individuals who defaulted, and a number close to zero 
for all individuals who did not. This intuition can be formalized using a 
mathematical equation called a likelihood function: 

hi) — n p(xi) (l-p(av)). (4.5) 

i-Vi = 1 i'-Vi'- o 

The estimates /3q and /3\ are chosen to maximize this likelihood function. 

Maximum likelihood is a very general approach that is used to fit many 
of the non-linear models that we examine throughout this book. In the 
linear regression setting, the least squares approach is in fact a special case 
of maximum likelihood. The mathematical details of maximum likelihood 
are beyond the scope of this book. However, in general, logistic regression 
and other models can be easily fit using a statistical software package such 
as R, and so we do not need to concern ourselves with the details of the 
maximum likelihood fitting procedure. 

Table 4.1 shows the coefficient estimates and related information that 
result from fitting a logistic regression model on the Default data in order 
to predict the probability of default=Yes using balance. We see that fd i = 
0.0055; this indicates that an increase in balance is associated with an 
increase in the probability of default. To be precise, a one-unit increase in 
balance is associated with an increase in the log odds of default by 0.0055 
units. 


likelihood 

function 
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Coefficient 

Std. error 

Z-statistic 

P-value 

Intercept 

-10.6513 

0.3612 

-29.5 

<0.0001 

balance 

0.0055 

0.0002 

24.9 

<0.0001 


TABLE 4.1. For the Default data, estimated coefficients of the logistic regres¬ 
sion model that predicts the probability of default using balance. A one-unit 
increase in balance is associated with an increase in the log odds of default by 
0.0055 units. 


Many aspects of the logistic regression output shown in Table 4.1 are 
similar to the linear regression output of Chapter 3. For example, we can 
measure the accuracy of the coefficient estimates by computing their stan¬ 
dard errors. The ^-statistic in Table 4.1 plays the same role as the f-statistic 
in the linear regression output, for example in Table 3.1 on page 68. For 
instance, the ^-statistic associated with /3i is equal to j3\/SE((3 1 ), and so a 
large (absolute) value of the ^-statistic indicates evidence against the null 
hypothesis H 0 : /?i = 0. This null hypothesis implies that p(X) = — 

in other words, that the probability of default does not depend on balance. 
Since the p-value associated with balance in Table 4.1 is tiny, we can reject 
Hq. In other words, we conclude that there is indeed an association between 
balance and probability of default. The estimated intercept in Tabic 4.1 
is typically not of interest; its main purpose is to adjust the average fitted 
probabilities to the proportion of ones in the data. 


4-3.3 Making Predictions 

Once the coefficients have been estimated, it is a simple matter to compute 
the probability of default for any given credit card balance. For example, 
using the coefficient estimates given in Table 4.1, we predict that the default 
probability for an individual with a balance of $1,000 is 

e /3 0 +/3i-Y e -10.6513+0.0055x1,000 

^ e po+PiX — 1 -I- g —10.6513+0.0055x 1,000 ~ 0.00576, 

which is below 1 %. In contrast, the predicted probability of default for an 
individual with a balance of $2,000 is much higher, and equals 0.586 or 
58.6%. 

One can use qualitative predictors with the logistic regression model 
using the dummy variable approach from Section 3.3.1. As an example, 
the Default data set contains the qualitative variable student. To fit the 
model we simply create a dummy variable that takes on a value of 1 for 
students and 0 for non-students. The logistic regression model that results 
from predicting probability of default from student status can be seen in 
Table 4.2. The coefficient associated with the dummy variable is positive, 
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Coefficient 

Std. error 

Z-statistic 

P-value 

Intercept 

—3.5041 

0.0707 

-49.55 

<0.0001 

student[Yes] 

0.4049 

0.1150 

3.52 

0.0004 


TABLE 4.2. For the Default data, estimated coefficients of the logistic regres¬ 
sion model that predicts the probability of default using student status. Student 
status is encoded as a dummy variable, with a value of 1 for a student and a value 
of 0 for a non-student, and represented by the variable student [Yes] in the table. 


and the associated p-value is statistically significant. This indicates that 
students tend to have higher default probabilities than non-students: 


Pr(def ault=Yes | student=Yes) 
Pr(def ault=Yes|student=No) 


e -3.5041+0.4049x1 

l _|_ e -3.5041+0.4049x1 ~' 0.0431 

-3.5041+0.4049x0 

___ = n 02Q2 

1 e -3.5041+0.4049x0 


4-3-4 Multiple Logistic Regression 

We now consider the problem of predicting a binary response using multiple 
predictors. By analogy with the extension from simple to multiple linear 
regression in Chapter 3, we can generalize (4.4) as follows: 

log ^ 1 ^p(X)) = ^ lXl "1-+ PpXpi (4-6) 

where A" = (Xi,..., X p ) are p predictors. Equation 4.6 can be rewritten as 

e /3o+/3iXiH- \-fipXp 

= 1 + e /3o+/3iA'i + -+/3p-\' p ' ( 4 - 7 ) 

Just as in Section 4.3.2, we use the maximum likelihood method to estimate 

fto } Pi ) • • * 5 fHp ■ 

Table 4.3 shows the coefficient estimates for a logistic regression model 
that uses balance, income (in thousands of dollars), and student status to 
predict probability of default. There is a surprising result here. The p- 
values associated with balance and the dummy variable for student status 
are very small, indicating that each of these variables is associated with 
the probability of default. However, the coefficient for the dummy variable 
is negative, indicating that students are less likely to default than non¬ 
students. In contrast, the coefficient for the dummy variable is positive in 
Table 4.2. How is it possible for student status to be associated with an 
increase in probability of default in Table 4.2 and a decrease in probability 
of default in Table 4.3? The left-hand panel of Figure 4.3 provides a graph¬ 
ical illustration of this apparent paradox. The orange and blue solid lines 
show the average default rates for students and non-students, respectively, 
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Coefficient 

Std. error 

Z-statistic 

P-value 

Intercept 

-10.8690 

0.4923 

-22.08 

<0.0001 

balance 

0.0057 

0.0002 

24.74 

<0.0001 

income 

0.0030 

0.0082 

0.37 

0.7115 

student [Yes] 

-0.6468 

0.2362 

-2.74 

0.0062 


TABLE 4.3. For the Default data, estimated coefficients of the logistic regres¬ 
sion model that predicts the probability of default using balance, income, and 
student status. Student status is encoded as a dummy variable student [Yes] , 
with a value of 1 for a student and a value of 0 for a non-student. In fitting this 
model, income was measured in thousands of dollars. 


as a function of credit card balance. The negative coefficient for student in 
the multiple logistic regression indicates that for a fixed value of balance 
and income, a student is less likely to default than a non-student. Indeed, 
we observe from the left-hand panel of Figure 4.3 that the student default 
rate is at or below that of the non-student default rate for every value of 
balance. But the horizontal broken lines near the base of the plot, which 
show the default rates for students and non-students averaged over all val¬ 
ues of balance and income, suggest the opposite effect: the overall student 
default rate is higher than the non-student default rate. Consequently, there 
is a positive coefficient for student in the single variable logistic regression 
output shown in Table 4.2. 

The right-hand panel of Figure 4.3 provides an explanation for this dis¬ 
crepancy. The variables student and balance are correlated. Students tend 
to hold higher levels of debt, which is in turn associated with higher prob¬ 
ability of default. In other words, students are more likely to have large 
credit card balances, which, as we know from the left-hand panel of Fig¬ 
ure 4.3, tend to be associated with high default rates. Thus, even though 
an individual student with a given credit card balance will tend to have a 
lower probability of default than a non-student with the same credit card 
balance, the fact that students on the whole tend to have higher credit card 
balances means that overall, students tend to default at a higher rate than 
non-students. This is an important distinction for a credit card company 
that is trying to determine to whom they should offer credit. A student is 
riskier than a non-student if no information about the student’s credit card 
balance is available. However, that student is less risky than a non-student 
with the same credit card balancel 

This simple example illustrates the dangers and subtleties associated 
with performing regressions involving only a single predictor when other 
predictors may also be relevant. As in the linear regression setting, the 
results obtained using one predictor may be quite different from those ob¬ 
tained using multiple predictors, especially when there is correlation among 
the predictors. In general, the phenomenon seen in Figure 4.3 is known as 
confounding. 

confounding 
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Student Status 


FIGURE 4.3. Confounding in the Default data. Left: Default rates are shown 
for students (orange) and non-students (blue). The solid lines display default rate 
as a function of balance, while the horizontal broken lines display the overall 
default rates. Right: Boxplots of balance for students (orange) and non-students 
(blue) are shown. 


By substituting estimates for the regression coefficients from Table 4.3 
into (4.7), we can make predictions. For example, a student with a credit 
card balance of $1, 500 and an income of $40, 000 has an estimated proba¬ 
bility of default of 


P(X) 


e -10.869+0.00574x1,500+0.003x40-0.6468x1 
1 + e -10 - 869 + 0 ' 00574xl ’ 500 + a003x40 ~ a6468xl 


0.058. 


(4.8) 


A non-student with the same balance and income has an estimated prob¬ 
ability of default of 


P(X) 


e -10.869+0.00574x1,500+0.003x40-0.6468x0 
l _|_ e -10.869+0.00574x1,500+0.003x40-0.6468x0 


0.105. 


(4.9) 


(Here we multiply the income coefficient estimate from Table 4.3 by 40, 
rather than by 40,000, because in that table the model was fit with income 
measured in units of $1, 000.) 


4-3.5 Logistic Regression for >2 Response Classes 

We sometimes wish to classify a response variable that has more than two 
classes. For example, in Section 4.2 we had three categories of medical con¬ 
dition in the emergency room: stroke, drug overdose, epileptic seizure. 
In this setting, we wish to model both Pr(Y = stroke|X) and Pr(Y = 
drug overdose|W), with the remaining Pr(Y = epileptic seizure|W) = 
1 — Pr(y = stroke|A) — Pr(Y = drug overdose|AT). The two-class logis¬ 
tic regression models discussed in the previous sections have multiple-class 
extensions, but in practice they tend not to be used all that often. One of 
the reasons is that the method we discuss in the next section, discriminant 
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analysis , is popular for multiple-class classification. So we do not go into 
the details of multiple-class logistic regression here, but simply note that 
such an approach is possible, and that software for it is available in R. 


4.4 Linear Discriminant Analysis 

Logistic regression involves directly modeling Pr(F = k\X = x) using the 
logistic function, given by (4.7) for the case of two response classes. In 
statistical jargon, we model the conditional distribution of the response Y , 
given the predictor(s) A". We now consider an alternative and less direct 
approach to estimating these probabilities. In this alternative approach, 
we model the distribution of the predictors X separately in each of the 
response classes (i.e. given Y), and then use Bayes’ theorem to flip these 
around into estimates for Pr(F = k\X = x). When these distributions are 
assumed to be normal, it turns out that the model is very similar in form 
to logistic regression. 

Why do we need another method, when we have logistic regression? 
There are several reasons: 

• When the classes are well-separated, the parameter estimates for the 
logistic regression model are surprisingly unstable. Linear discrimi¬ 
nant analysis does not suffer from this problem. 

• If n is small and the distribution of the predictors X is approximately 
normal in each of the classes, the linear discriminant model is again 
more stable than the logistic regression model. 

• As mentioned in Section 4.3.5, linear discriminant analysis is popular 
when we have more than two response classes. 


f.f.l Using Bayes’ Theorem for Classification 

Suppose that we wish to classify an observation into one of K classes, where 
K > 2. In other words, the qualitative response variable Y can take on K 
possible distinct and unordered values. Let nk represent the overall or prior 
probability that a randomly chosen observation comes from the fcth class; 
this is the probability that a given observation is associated with the kth 
category of the response variable Y. Let fk{X) = Pr(X = x\Y = k) denote 
the density function of X for an observation that comes from the /cth class. 
In other words, fk{%) is relatively large if there is a high probability that 
an observation in the fcth class has X ss x, and fk{x) is small if it is very 
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unlikely that an observation in the kth class has X ss x. Then Bayes’ 
theorem states that 

Pr(F = k\X = x) = l kfk{x } ■ (4.10) 

Lz=i 

In accordance with our earlier notation, we will use the abbreviation p k (X) 
= Pr(Y = k\X). This suggests that instead of directly computing Pk(X) 
as in Section 4.3.1, we can simply plug in estimates of i and f k {X) into 
(4.10). In general, estimating 7 T k is easy if we have a random sample of 
Ys from the population: we simply compute the fraction of the training 
observations that belong to the fcth class. However, estimating fk(X) tends 
to be more challenging, unless we assume some simple forms for these 
densities. We refer to p k {x) as the posterior probability that an observation 
X = x belongs to the fcth class. That is, it is the probability that the 
observation belongs to the fcth class, given the predictor value for that 
observation. 

We know from Chapter 2 that the Bayes classifier, which classifies an 
observation to the class for which p k {X) is largest, has the lowest possible 
error rate out of all classifiers. (This is of course only true if the terms 
in (4.10) are all correctly specified.) Therefore, if we can find a way to 
estimate f k {X), then we can develop a classifier that approximates the 
Bayes classifier. Such an approach is the topic of the following sections. 

4-4Linear Discriminant Analysis for p = 1 

For now, assume that p = 1 -that is, we have only one predictor. We 
would like to obtain an estimate for f k {x) that we can plug into (4.10) in 
order to estimate p k {x). We will then classify an observation to the class 
for which Pk{x) is greatest. In order to estimate fk{x), we will first make 
some assumptions about its form. 

Suppose we assume that fk{x) is normal or Gaussian. In the one¬ 
dimensional setting, the normal density takes the form 

Mx)= T2 k exp (r^ (I ~ M)2 )’ (411) 

where p k and a 2 are the mean and variance parameters for the fcth class. 
For now, let us further assume that a\ = ... = cr K : that is, there is a shared 
variance term across all K classes, which for simplicity we can denote by 
a 2 . Plugging (4.11) into (4.10), we find that 

TTfc -/hr; exp p k ) 2 ) 

»(■) = ^ t t , -W (4.12) 

£i=i ^75^ ex P \-2^( x -W) 2 ) 

(Note that in (4.12), 7r k denotes the prior probability that an observation 
belongs to the fcth class, not to be confused with 7r ss 3.14159, the math¬ 
ematical constant.) The Bayes classifier involves assigning an observation 
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FIGURE 4.4. Left: Two one-dimensional normal density functions are shown. 
The dashed vertical line represents the Bayes decision boundary. Right: 20 obser¬ 
vations were drawn from each of the two classes, and are shown as histograms. 
The Bayes decision boundary is again shown as a dashed vertical line. The solid 
vertical line represents the LDA decision boundary estimated from the training 
data. 


X = x to the class for which (4.12) is largest. Taking the log of (4.12) 
and rearranging the terms, it is not hard to show that this is equivalent to 
assigning the observation to the class for which 

40) = X • ^ + log(7Tfc) (4.13) 

o z 2 a z 


is largest. For instance, if K = 2 and 7Ti = n 2 , then the Bayes classifier 
assigns an observation to class 1 if 2x{n\ —/r 2 ) > /uf — and to class 
2 otherwise. In this case, the Bayes decision boundary corresponds to the 
point where 


Mi ~ M 2 = Mi + M2 
2(mi — M2) 2 


(4.14) 


An example is shown in the left-hand panel of Figure 4.4. The two normal 
density functions that are displayed, fi(x) and f 2 (x), represent two distinct 
classes. The mean and variance parameters for the two density functions 
are pL\ = —1.25, = 1.25, and a\ = <j\ = 1. The two densities overlap, 

and so given that X — x, there is some uncertainty about the class to which 
the observation belongs. If we assume that an observation is equally likely 
to come from either class—that is, rr\ = 7t 2 = 0.5—then by inspection of 
(4.14), we see that the Bayes classifier assigns the observation to class 1 
if x < 0 and class 2 otherwise. Note that in this case, we can compute 
the Bayes classifier because we know that X is drawn from a Gaussian 
distribution within each class, and we know all of the parameters involved. 
In a real-life situation, we are not able to calculate the Bayes classifier. 

In practice, even if we are quite certain of our assumption that X is drawn 
from a Gaussian distribution within each class, we still have to estimate 
the parameters ..., plk, 7Ti, ..., 7r k, and a 2 . The linear discriminant 
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analysis (LDA) method approximates the Bayes classifier by plugging esti¬ 
mates for 7 Tfc, Hk, and er 2 into (4.13). In particular, the following estimates 
are used: 


Mfc 


a 


2 


- E 

71 7. • J 


n k 


1 


Xi 

i-Vi=k 
K 


n — K 


E E ( Xi ~ 


k—l i:yi=k 


(4.15) 


where n is the total number of training observations, and n k is the number 
of training observations in the fcth class. The estimate for /i k is simply the 
average of all the training observations from the fcth class, while a 2 can 
be seen as a weighted average of the sample variances for each of the K 
classes. Sometimes we have knowledge of the class membership probabili¬ 
ties 7ri,..., 7Tff, which can be used directly. In the absence of any additional 
information, LDA estimates n k using the proportion of the training obser¬ 
vations that belong to the fcth class. In other words, 

TTk=n k /n. (4.16) 

The LDA classifier plugs the estimates given in (4.15) and (4.16) into (4.13), 
and assigns an observation X = x to the class for which 


h{x) = x ■ ^ + log(7r fe ) (4.17) 

a z 2a z 

is largest. The word linear in the classifier’s name stems from the fact 
that the discriminant functions S k (x ) in (4.17) are linear functions of x (as 
opposed to a more complex function of x). 

The right-hand panel of Figure 4.4 displays a histogram of a random 
sample of 20 observations from each class. To implement LDA, we began 
by estimating 7 r k , fJ-k, and ct 2 using (4.15) and (4.16). We then computed the 
decision boundary, shown as a black solid line, that results from assigning 
an observation to the class for which (4.17) is largest. All points to the left 
of this line will be assigned to the green class, while points to the right of 
this line are assigned to the purple class. In this case, since m = ri 2 = 20, 
we have iti = 7T2. As a result, the decision boundary corresponds to the 
midpoint between the sample means for the two classes, (/ti + /t2)/2. The 
figure indicates that the LDA decision boundary is slightly to the left of 
the optimal Bayes decision boundary, which instead equals (/ii + /j. 2)/2 = 
0. How well does the LDA classifier perform on this data? Since this is 
simulated data, we can generate a large number of test observations in order 
to compute the Bayes error rate and the LDA test error rate. These are 
10.6% and 11.1%, respectively. In other words, the LDA classifier’s error 
rate is only 0.5 % above the smallest possible error rate! This indicates that 
LDA is performing pretty well on this data set. 
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FIGURE 4.5. Two multivariate Gaussian density functions are shown, with 
p = 2. Left: The two predictors are uncorrelated. Right: The two variables have 
a correlation of 0.7. 


To reiterate, the LDA classifier results from assuming that the observa¬ 
tions within each class come from a normal distribution with a class-specific 
mean vector and a common variance a 2 , and plugging estimates for these 
parameters into the Bayes classifier. In Section 4.4.4, we will consider a less 
stringent set of assumptions, by allowing the observations in the fcth class 
to have a class-specific variance, a\. 


4-4-3 Linear Discriminant Analysis for p >1 

We now extend the LDA classifier to the case of multiple predictors. To 
do this, we will assume that X = (Ad, X 2 , ■ .., X p ) is drawn from a multi¬ 
variate Gaussian (or multivariate normal) distribution, with a class-specific 
mean vector and a common covariance matrix. We begin with a brief review 
of such a distribution. 

The multivariate Gaussian distribution assumes that each individual pre¬ 
dictor follows a one-dimensional normal distribution, as in (4.11), with some 
correlation between each pair of predictors. Two examples of multivariate 
Gaussian distributions with p = 2 are shown in Figure 4.5. The height of 
the surface at any particular point represents the probability that both X\ 
and X 2 fall in a small region around that point. In either panel, if the sur¬ 
face is cut along the X\ axis or along the X 2 axis, the resulting cross-section 
will have the shape of a one-dimensional normal distribution. The left-hand 
panel of Figure 4.5 illustrates an example in which Var(Ad) = Var(A' 2 ) and 
Cor(Ad, A 2 ) = 0; this surface has a characteristic bell shape. However, the 
bell shape will be distorted if the predictors are correlated or have unequal 
variances, as is illustrated in the right-hand panel of Figure 4.5. In this 
situation, the base of the bell will have an elliptical, rather than circular, 
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FIGURE 4.6. An example with three classes. The observations from each class 
are drawn from a multivariate Gaussian distribution with p = 2, with a class-spe¬ 
cific mean vector and a common covariance matrix. Left: Ellipses that contain 
95 % of the probability for each of the three classes are shown. The dashed lines 
are the Bayes decision boundaries. Right: 20 observations were generated from 
each class, and the corresponding LDA decision boundaries are indicated using 
solid black lines. The Bayes decision boundaries are once again shown as dashed 
lines. 


shape. To indicate that a p-dimensional random variable A" has a multi¬ 
variate Gaussian distribution, we write X ~ jV(p, E). Here E(X) = /j is 
the mean of X (a vector with p components), and Cov(A) = £ is the 
p x p covariance matrix of X. Formally, the multivariate Gaussian density 
is defined as 


f(x) = (27t)p/^|S| 1 / 2 6XP ~ M)TS ” 1(a: “ ' (418) 

In the case of p > 1 predictors, the LDA classifier assumes that the 
observations in the fcth class are drawn from a multivariate Gaussian dis¬ 
tribution lV(/Ufc,£), where /.ik is a class-specific mean vector, and S is a 
covariance matrix that is common to all K classes. Plugging the density 
function for the A'tli class, fk(X = x), into (4.10) and performing a little 
bit of algebra reveals that the Bayes classifier assigns an observation X = x 
to the class for which 

S k {x) = x T H~ 1 fik - +log7T fc (4.19) 

is largest. This is the vector/matrix version of (4.13). 

An example is shown in the left-hand panel of Figure 4.6. Three equally- 
sized Gaussian classes are shown with class-specific mean vectors and a 
common covariance matrix. The three ellipses represent regions that con¬ 
tain 95 % of the probability for each of the three classes. The dashed lines 
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are the Bayes decision boundaries. In other words, they represent the set 
of values x for which 6k(x) = 8e(x)\ i.e. 

a: T irVfc - = x t Yi~ 1 hi - i/rfirVi (4.20) 

for k ^ l. (The log7Tfc term from (4.19) has disappeared because each of 
the three classes has the same number of training observations; i.e. tti~ is 
the same for each class.) Note that there are three lines representing the 
Bayes decision boundaries because there are three pairs of classes among 
the three classes. That is, one Bayes decision boundary separates class 1 
from class 2, one separates class 1 from class 3, and one separates class 2 
from class 3. These three Bayes decision boundaries divide the predictor 
space into three regions. The Bayes classifier will classify an observation 
according to the region in which it is located. 

Once again, we need to estimate the unknown parameters pi,..., px, 
7Ti, ..., hk, and S; the formulas are similar to those used in the one¬ 
dimensional case, given in (4.15). To assign a new observation X = x, 
LDA plugs these estimates into (4.19) and classifies to the class for which 
5k{x) is largest. Note that in (4.19) Sk(x) is a linear function of x ; that is, 
the LDA decision rule depends on x only through a linear combination of 
its elements. Once again, this is the reason for the word linear in LDA. 

In the right-hand panel of Figure 4.6, 20 observations drawn from each of 
the three classes are displayed, and the resulting LDA decision boundaries 
are shown as solid black lines. Overall, the LDA decision boundaries are 
pretty close to the Bayes decision boundaries, shown again as dashed lines. 
The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770, 
respectively. This indicates that LDA is performing well on this data. 

We can perform LDA on the Default data in order to predict whether 
or not an individual will default on the basis of credit card balance and 
student status. The LDA model fit to the 10, 000 training samples results 
in a training error rate of 2.75 %. This sounds like a low error rate, but two 
caveats must be noted. 

• First of all, training error rates will usually be lower than test error 
rates, which are the real quantity of interest. In other words, we 
might expect this classifier to perform worse if we use it to predict 
whether or not a new set of individuals will default. The reason is 
that we specifically adjust the parameters of our model to do well on 
the training data. The higher the ratio of parameters p to number 
of samples n, the more we expect this overfitting to play a role. For 
these data we don’t expect this to be a problem, since p = 3 and 
n = 10,000. 

• Second, since only 3.33% of the individuals in the training sample 
defaulted, a simple but useless classifier that always predicts that 
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True default status 



No 

Yes 

Total 

Predicted 

No 

9,644 

252 

9,896 

default status 

Yes 

23 

81 

104 


Total 

9,667 

333 

10,000 


TABLE 4.4. A confusion matrix compares the LDA predictions to the true de¬ 
fault statuses for the 10, 000 training observations in the Default data set. Ele¬ 
ments on the diagonal of the matrix represent individuals whose default statuses 
were correctly predicted, while off-diagonal elements represent individuals that 
were mis classified. LDA made incorrect predictions for 23 individuals who did 
not default and for 252 individuals who did default. 

each individual will not default, regardless of his or her credit card 
balance and student status, will result in an error rate of 3.33%. In 
other words, the trivial null classifier will achieve an error rate that 
is only a bit higher than the LDA training set error rate. 

In practice, a binary classifier such as this one can make two types of 
errors: it can incorrectly assign an individual who defaults to the no default 
category, or it can incorrectly assign an individual who does not default to 
the default category. It is often of interest to determine which of these two 
types of errors are being made. A confusion matrix , shown for the Default 
data in Table 4.4, is a convenient way to display this information. The 
table reveals that LDA predicted that a total of 104 people would default. 
Of these people, 81 actually defaulted and 23 did not. Hence only 23 out 
of 9,667 of the individuals who did not default were incorrectly labeled. 
This looks like a pretty low error rate! However, of the 333 individuals who 
defaulted, 252 (or 75.7%) were missed by LDA. So while the overall error 
rate is low, the error rate among individuals who defaulted is very high. 
From the perspective of a credit card company that is trying to identify 
high-risk individuals, an error rate of 252/333 = 75.7% among individuals 
who default may well be unacceptable. 

Class-specific performance is also important in medicine and biology, 
where the terms sensitivity and specificity characterize the performance of 
a classifier or screening test. In this case the sensitivity is the percentage of 
true defaulters that are identified, a low 24.3% in this case. The specificity 
is the percentage of non-defaulters that are correctly identified, here (1 — 
23/9,667) x 100 = 99.8%. 

Why does LDA do such a poor job of classifying the customers who de¬ 
fault? In other words, why does it have such a low sensitivity? As we have 
seen, LDA is trying to approximate the Bayes classifier, which has the low¬ 
est total error rate out of all classifiers (if the Gaussian model is correct). 
That is, the Bayes classifier will yield the smallest possible total number 
of misclassified observations, irrespective of which class the errors come 
from. That is, some misclassifications will result from incorrectly assigning 
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True default status 



No 

Yes 

Total 

Predicted 

No 

9,432 

138 

9,570 

default status 

Yes 

235 

195 

430 


Total 

9,667 

333 

10,000 


TABLE 4.5. A confusion matrix compares the LDA predictions to the true de¬ 
fault statuses for the 10,000 training observations in the Default data set, using 
a modified threshold value that predicts default for any individuals whose posterior 
default probability exceeds 20 %. 

a customer who does not default to the default class, and others will re¬ 
sult from incorrectly assigning a customer who defaults to the non-default 
class. In contrast, a credit card company might particularly wish to avoid 
incorrectly classifying an individual who will default, whereas incorrectly 
classifying an individual who will not default, though still to be avoided, 
is less problematic. We will now see that it is possible to modify LDA in 
order to develop a classifier that better meets the credit card company’s 
needs. 

The Bayes classifier works by assigning an observation to the class for 
which the posterior probability Pk{X) is greatest. In the two-class case, this 
amounts to assigning an observation to the default class if 

Pr(default = Yes|X = x) > 0.5. (4-21) 

Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50 % 
for the posterior probability of default in order to assign an observation 
to the default class. However, if we are concerned about incorrectly pre¬ 
dicting the default status for individuals who default, then we can consider 
lowering this threshold. For instance, we might label any customer with a 
posterior probability of default above 20% to the default class. In other 
words, instead of assigning an observation to the default class if (4.21) 
holds, we could instead assign an observation to this class if 

P(default = Yes|X = x) > 0.2. (4.22) 

The error rates that result from taking this approach are shown in Table 4.5. 
Now LDA predicts that 430 individuals will default. Of the 333 individuals 
who default, LDA correctly predicts all but 138, or 41.4%. This is a vast 
improvement over the error rate of 75.7% that resulted from using the 
threshold of 50%. However, this improvement comes at a cost: now 235 
individuals who do not default are incorrectly classified. As a result, the 
overall error rate has increased slightly to 3.73 %. But a credit card company 
may consider this slight increase in the total error rate to be a small price to 
pay for more accurate identification of individuals who do indeed default. 

Figure 4.7 illustrates the trade-off that results from modifying the thresh¬ 
old value for the posterior probability of default. Various error rates are 
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FIGURE 4.7. For the Default data set, error rates are shown as a function of 
the threshold value for the posterior probability that is used to perform the assign¬ 
ment. The black solid line displays the overall error rate. The blue dashed line 
represents the fraction of defaulting customers that are incorrectly classified, and 
the orange dotted line indicates the fraction of errors among the non-defaulting 
customers. 


shown as a function of the threshold value. Using a threshold of 0.5, as in 
(4.21), minimizes the overall error rate, shown as a black solid line. This 
is to be expected, since the Bayes classifier uses a threshold of 0.5 and is 
known to have the lowest overall error rate. But when a threshold of 0.5 is 
used, the error rate among the individuals who default is quite high (blue 
dashed line). As the threshold is reduced, the error rate among individuals 
who default decreases steadily, but the error rate among the individuals 
who do not default increases. How can we decide which threshold value is 
best? Such a decision must be based on domain knowledge, such as detailed 
information about the costs associated with default. 

The ROC curve is a popular graphic for simultaneously displaying the 
two types of errors for all possible thresholds. The name “ROC” is his¬ 
toric, and comes from communications theory. It is an acronym for receiver 
operating characteristics. Figure 4.8 displays the ROC curve for the LDA 
classifier on the training data. The overall performance of a classifier, sum¬ 
marized over all possible thresholds, is given by the area under the (ROC) 
curve (AUC). An ideal ROC curve will hug the top left corner, so the larger 
the AUC the better the classifier. For this data the AUC is 0.95, which is 
close to the maximum of one so would be considered very good. We expect 
a classifier that performs no better than chance to have an AUC of 0.5 
(when evaluated on an independent test set not used in model training). 
ROC curves are useful for comparing different classifiers, since they take 
into account all possible thresholds. It turns out that the ROC curve for the 
logistic regression model of Section 4.3.4 fit to these data is virtually indis¬ 
tinguishable from this one for the LDA model, so we do not display it here. 

As we have seen above, varying the classifier threshold changes its true 
positive and false positive rate. These are also called the sensitivity and one 
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ROC Curve 



FIGURE 4.8. A ROC curve for the LDA classifier on the Default data. It 
traces out two types of error as we vary the threshold value for the posterior 
probability of default. The actual thresholds are not shown. The true positive rate 
is the sensitivity: the fraction of defaulters that are correctly identified, using 
a given threshold value. The false positive rate is 1-specificity: the fraction of 
non-defaulters that we classify incorrectly as defaulters, using that same threshold 
value. The ideal ROC curve hugs the top left corner, indicating a high true positive 
rate and a low false positive rate. The dotted line represents the “no information” 
classifier; this is what we would expect if student status and credit card balance 
are not associated with probability of default. 



Predicted class 



— or Null 

+ or Non-null 

Total 

True — or Null 

True Neg. (TN) 

False Pos. (FP) 

N 

class + or Non-null 

False Neg. (FN) 

True Pos. (TP) 

P 

Total 

N* 

P* 



TABLE 4.6. Possible results when applying a classifier or diagnostic test to a 
population. 

minus the specificity of our classifier. Since there is an almost bewildering 
array of terms used in this context, we now give a summary. Table 4.6 
shows the possible results when applying a classifier (or diagnostic test) 
to a population. To make the connection with the epidemiology literature, 
we think of “+” as the “disease” that we are trying to detect, and ” as 
the “non-disease” state. To make the connection to the classical hypothesis 
testing literature, we think of ” as the null hypothesis and “+” as the 
alternative (non-null) hypothesis. In the context of the Default data, “+” 
indicates an individual who defaults, and ” indicates one who does not. 
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Name 

Definition 

Synonyms 

False Pos. rate 
True Pos. rate 
Pos. Pred. value 
Neg. Pred. value 

FP/N 

TP/P 

TP/P* 

TN/N* 

Type I error, 1—Specificity 

1—Type II error, power, sensitivity, recall 
Precision, 1—false discovery proportion 


TABLE 4.7. Important measures for classification and diagnostic testing, 
derived from quantities in Table 4.6. 

Table 4.7 lists many of the popular performance measures that are used in 
this context. The denominators for the false positive and true positive rates 
are the actual population counts in each class. In contrast, the denominators 
for the positive predictive value and the negative predictive value are the 
total predicted counts for each class. 

4 - 4-4 Quadratic Discriminant Analysis 

As we have discussed, LDA assumes that the observations within each 
class are drawn from a multivariate Gaussian distribution with a class- 
specific mean vector and a covariance matrix that is common to all K 
classes. Quadratic discriminant analysis (QDA) provides an alternative 
approach. Like LDA, the QDA classifier results from assuming that the 
observations from each class are drawn from a Gaussian distribution, and 
plugging estimates for the parameters into Bayes’ theorem in order to per¬ 
form prediction. However, unlike LDA, QDA assumes that each class has 
its own covariance matrix. That is, it assumes that an observation from the 
kth class is of the form X ~ N(y,k, S k ), where Sj, is a covariance matrix 
for the A'tli class. Under this assumption, the Bayes classifier assigns an 
observation X = x to the class for which 

Sk(x) = -^(x - y k ) T 'Sf: 1 (x - y k ) - \ log |S fc | +log7r fe 

= -\x T ^‘k 1 X + X T 'S,~ l f 1 y k ~ V- ylog|S fc | + log7r fc 

(4.23) 

is largest. So the QDA classifier involves plugging estimates for £*,, y k , 
and 7Tfc into (4.23), and then assigning an observation X = x to the class 
for which this quantity is largest. Unlike in (4.19), the quantity x appears 
as a quadratic function in (4.23). This is where QDA gets its name. 

Why does it matter whether or not we assume that the K classes share a 
common covariance matrix? In other words, why would one prefer LDA to 
QDA, or vice-versa? The answer lies in the bias-variance trade-off. When 
there are p predictors, then estimating a covariance matrix requires esti¬ 
mating p(p+ 1)/2 parameters. QDA estimates a separate covariance matrix 
for each class, for a total of I\p(p+ 1)/2 parameters. With 50 predictors this 
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FIGURE 4.9. Left: The Bayes (purple dashed), LDA (black dotted), and QDA 
(green solid) decision boundaries for a two-class problem with Ei = E 2 . The 
shading indicates the QDA decision rule. Since the Bayes decision boundary is 
linear, it is more accurately approximated by LDA than by QDA. Right: Details 
are as given in the left-hand panel, except that Si ^ S 2 . Since the Bayes decision 
boundary is non-linear, it is more accurately approximated by QDA than by LDA. 


is some multiple of 1,225, which is a lot of parameters. By instead assum¬ 
ing that the K classes share a common covariance matrix, the LDA model 
becomes linear in x, which means there are Kp linear coefficients to esti¬ 
mate. Consequently, LDA is a much less flexible classifier than QDA, and 
so has substantially lower variance. This can potentially lead to improved 
prediction performance. But there is a trade-off: if LDA’s assumption that 
the K classes share a common covariance matrix is badly off, then LDA 
can suffer from high bias. Roughly speaking, LDA tends to be a better bet 
than QDA if there are relatively few training observations and so reducing 
variance is crucial. In contrast, QDA is recommended if the training set is 
very large, so that the variance of the classifier is not a major concern, or if 
the assumption of a common covariance matrix for the K classes is clearly 
untenable. 

Figure 4.9 illustrates the performances of LDA and QDA in two scenarios. 
In the left-hand panel, the two Gaussian classes have a common correla¬ 
tion of 0.7 between X\ and X 2 . As a result, the Bayes decision boundary 
is linear and is accurately approximated by the LDA decision boundary. 
The QDA decision boundary is inferior, because it suffers from higher vari¬ 
ance without a corresponding decrease in bias. In contrast, the right-hand 
panel displays a situation in which the orange class has a correlation of 0.7 
between the variables and the blue class has a correlation of —0.7. Now 
the Bayes decision boundary is quadratic, and so QDA more accurately 
approximates this boundary than does LDA. 
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In this chapter, we have considered three different classification approaches: 
logistic regression, LDA, and QDA. In Chapter 2, we also discussed the 
K-nearest neighbors (KNN) method. We now consider the types of 
scenarios in which one approach might dominate the others. 

Though their motivations differ, the logistic regression and LDA methods 
are closely connected. Consider the two-class setting with p = 1 predictor, 
and let pi(x) andp 2 (x) = l—pi(x) be the probabilities that the observation 
X = x belongs to class 1 and class 2, respectively. In the LDA framework, 
we can see from (4.12) to (4.13) (and a bit of simple algebra) that the log 
odds is given by 


log 


( piC) \ 
\1 -Pi(x)J 



= Co + Cix, 


(4.24) 


where Co and C\ are functions of p,i,p, 2 , and a 2 . From (4.4), we know that 
in logistic regression, 


i° g = & 0 + P lX - ( 4 - 25 ) 

Both (4.24) and (4.25) are linear functions of x. Hence, both logistic re¬ 
gression and LDA produce linear decision boundaries. The only difference 
between the two approaches lies in the fact that (3q and j3\ are estimated 
using maximum likelihood, whereas Co and c\ are computed using the esti¬ 
mated mean and variance from a normal distribution. This same connection 
between LDA and logistic regression also holds for multidimensional data 
with p > 1. 

Since logistic regression and LDA differ only in their fitting procedures, 
one might expect the two approaches to give similar results. This is often, 
but not always, the case. LDA assumes that the observations are drawn 
from a Gaussian distribution with a common covariance matrix in each 
class, and so can provide some improvements over logistic regression when 
this assumption approximately holds. Conversely, logistic regression can 
outperform LDA if these Gaussian assumptions are not met. 

Recall from Chapter 2 that KNN takes a completely different approach 
from the classifiers seen in this chapter. In order to make a prediction for 
an observation X = x, the K training observations that are closest to x are 
identified. Then X is assigned to the class to which the plurality of these 
observations belong. Hence KNN is a completely non-parametric approach: 
no assumptions are made about the shape of the decision boundary. There¬ 
fore, we can expect this approach to dominate LDA and logistic regression 
when the decision boundary is highly non-linear. On the other hand, KNN 
does not tell us which predictors are important; we don’t get a table of 
coefficients as in Table 4.3. 
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SCENARIO 1 



SCENARIO 2 



KNN-1 KNN-CV LDA Logistic QDA 


SCENARIO 3 



FIGURE 4.10. Boxplots of the test error rates for each of the linear scenarios 
described in the main text. 

SCENARIO 4 SCENARIOS SCENARIO 6 




FIGURE 4.11. Boxplots of the test error rates for each of the non-linear sce¬ 
narios described in the main text. 


Finally, QDA serves as a compromise between the non-parametric KNN 
method and the linear LDA and logistic regression approaches. Since QDA 
assumes a quadratic decision boundary, it can accurately model a wider 
range of problems than can the linear methods. Though not as flexible 
as KNN, QDA can perform better in the presence of a limited number of 
training observations because it does make some assumptions about the 
form of the decision boundary. 

To illustrate the performances of these four classification approaches, 
we generated data from six different scenarios. In three of the scenarios, 
the Bayes decision boundary is linear, and in the remaining scenarios it 
is non-linear. For each scenario, we produced 100 random training data 
sets. On each of these training sets, we fit each method to the data and 
computed the resulting test error rate on a large test set. Results for the 
linear scenarios are shown in Figure 4.10, and the results for the non-linear 
scenarios are in Figure 4.11. The KNN method requires selection of K , the 
number of neighbors. We performed KNN with two values of K: K = 1, 
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and a value of K that was chosen automatically using an approach called 
cross-validation , which we discuss further in Chapter 5. 

In each of the six scenarios, there were p = 2 predictors. The scenarios 
were as follows: 

Scenario 1: There were 20 training observations in each of two classes. 

The observations within each class were uncorrelated random normal 
variables with a different mean in each class. The left-hand panel 
of Figure 4.10 shows that LDA performed well in this setting, as 
one would expect since this is the model assumed by LDA. KNN 
performed poorly because it paid a price in terms of variance that 
was not offset by a reduction in bias. QDA also performed worse 
than LDA, since it fit a more flexible classifier than necessary. Since 
logistic regression assumes a linear decision boundary, its results were 
only slightly inferior to those of LDA. 

Scenario 2: Details are as in Scenario 1, except that within each 
class, the two predictors had a correlation of —0.5. The center panel 
of Figure 4.10 indicates little change in the relative performances of 
the methods as compared to the previous scenario. 

Scenario 3: We generated Xi and X 2 from the t-distribution, with 
50 observations per class. The f-distribution has a similar shape to distribution 
the normal distribution, but it has a tendency to yield more extreme 
points—that is, more points that are far from the mean. In this set¬ 
ting, the decision boundary was still linear, and so fit into the logistic 
regression framework. The set-up violated the assumptions of LDA, 
since the observations were not drawn from a normal distribution. 

The right-hand panel of Figure 4.10 shows that logistic regression 
outperformed LDA, though both methods were superior to the other 
approaches. In particular, the QDA results deteriorated considerably 
as a consequence of non-normality. 

Scenario 4 : The data were generated from a normal distribution, 
with a correlation of 0.5 between the predictors in the first class, 
and correlation of —0.5 between the predictors in the second class. 

This setup corresponded to the QDA assumption, and resulted in 
quadratic decision boundaries. The left-hand panel of Figure 4.11 
shows that QDA outperformed all of the other approaches. 

Scenario 5: Within each class, the observations were generated from 
a normal distribution with uncorrelated predictors. However, the re¬ 
sponses were sampled from the logistic function using Xf, X%, and 
X\ x X 2 as predictors. Consequently, there is a quadratic decision 
boundary. The center panel of Figure 4.11 indicates that QDA once 
again performed best, followed closely by KNN-CV. The linear meth¬ 
ods had poor performance. 
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Scenario 6: Details are as in the previous scenario, but the responses 
were sampled from a more complicated non-linear function. As a re¬ 
sult, even the quadratic decision boundaries of QDA could not ade¬ 
quately model the data. The right-hand panel of Figure 4.11 shows 
that QDA gave slightly better results than the linear methods, while 
the much more flexible KNN-CV method gave the best results. But 
KNN with K = 1 gave the worst results out of all methods. This 
highlights the fact that even when the data exhibits a complex non¬ 
linear relationship, a non-parametric method such as KNN can still 
give poor results if the level of smoothness is not chosen correctly. 

These six examples illustrate that no one method will dominate the oth¬ 
ers in every situation. When the true decision boundaries are linear, then 
the LDA and logistic regression approaches will tend to perform well. When 
the boundaries are moderately non-linear, QDA may give better results. 
Finally, for much more complicated decision boundaries, a non-parametric 
approach such as KNN can be superior. But the level of smoothness for a 
non-parametric approach must be chosen carefully. In the next chapter we 
examine a number of approaches for choosing the correct level of smooth¬ 
ness and, in general, for selecting the best overall method. 

Finally, recall from Chapter 3 that in the regression setting we can accom¬ 
modate a non-linear relationship between the predictors and the response 
by performing regression using transformations of the predictors. A similar 
approach could be taken in the classification setting. For instance, we could 
create a more flexible version of logistic regression by including X 2 , X 3 , 
and even X 4 as predictors. This may or may not improve logistic regres¬ 
sion’s performance, depending on whether the increase in variance due to 
the added flexibility is offset by a sufficiently large reduction in bias. We 
could do the same for LDA. If we added all possible quadratic terms and 
cross-products to LDA, the form of the model would be the same as the 
QDA model, although the parameter estimates would be different. This 
device allows us to move somewhere between an LDA and a QDA model. 


4.6 Lab: Logistic Regression, LDA, QDA, and 

KNN 

4-6.1 The Stock Market Data 

We will begin by examining some numerical and graphical summaries of 
the Smarket data, which is part of the ISLR library. This data set consists of 
percentage returns for the S&P 500 stock index over 1,250 days, from the 
beginning of 2001 until the end of 2005. For each date, we have recorded 
the percentage returns for each of the five previous trading days, Lagl 
through Lag5. We have also recorded Volume (the number of shares traded 
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on the previous day, in billions), Today (the percentage return on the date 
in question) and Direction (whether the market was Up or Down on this 
date). 

> library(ISLR) 

> names(Smarket) 

[1] "Year" "Lagl" "Lag2" "Lag3" "Lag4" 

[6] "Lag5" "Volume" "Today" "Direction" 

> dim(Smarket) 

[1] 1250 9 

> summary(Smarket) 

Year Lagl Lag2 


Min . 

2001 

Min . 

-4.92200 

Min . 

-4.92200 

1st Qu . 

2002 

1st Qu . 

-0.63950 

1st Qu . 

-0.63950 

Median 

2003 

Median 

0.03900 

Median 

0.03900 

Mean 

2003 

Mean 

0.00383 

Mean 

0.00392 

3rd Qu . 

2004 

3rd Qu. 

0 . 59675 

3rd Qu. 

0.59675 

Max . 

2005 

Max . 

5.73300 

Max . 

5.73300 


Lag3 Lag4 Lag5 


Min . 

-4.92200 

Min . 

-4.92200 

Min . 

-4.92200 

1st Qu . 

-0.64000 

1st Qu . 

-0.64000 

1st Qu . 

-0.64000 

Median 

0.03850 

Median 

0.03850 

Median 

0.03850 

Mean 

0.00172 

Mean 

0.00164 

Mean 

0.00561 

3rd Qu . 

0.59675 

3rd Qu . 

0.59675 

3rd Qu . 

0.59700 

Max . 

5.73300 

Max . 

5.73300 

Max . 

5.73300 


Volume Today Direction 


Min . 

0.356 

Min . 

-4.92200 

Down : 602 

1st Qu . 

1.257 

1st Qu . 

-0.63950 

Up :648 

Median 

1.423 

Median 

0.03850 


Mean 

1.478 

Mean 

0.00314 


3rd Qu . 

1.642 

3rd Qu. 

0.59675 


Max . 

3.152 

Max . 

5.73300 



> pairs(Smarket) 

The corO function produces a matrix that contains all of the pairwise 
correlations among the predictors in a data set. The first command below 
gives an error message because the Direction variable is qualitative. 


> cor(Smarket) 

Error in cor(Smarket) : ’x’ must be numeric 

> cor(Smarket [ ,-9]) 



Year 

Lagl 

Lag2 

Lag3 

Lag4 

Lag5 

Year 

1.0000 

0.02970 

0.03060 

0.03319 

0.03569 

0.02979 

Lagl 

0.0297 

1.00000 

-0.02629 

-0.01080 

-0.00299 

-0.00567 

Lag2 

0.0306 

-0.02629 

1.00000 

-0.02590 

-0.01085 

-0.00356 

Lag3 

0.0332 

-0.01080 

-0.02590 

1.00000 

-0.02405 

-0.01881 

Lag4 

0.0357 

-0.00299 

-0.01085 

-0.02405 

1.00000 

-0.02708 

Lag5 

0.0298 

-0 . 00567 

-0.00356 

-0.01881 

-0.02708 

1.00000 

Volume 

0.5390 

0.04091 

-0.04338 

-0.04182 

-0.04841 

-0.02200 

Today 

0.0301 

-0.02616 

-0.01025 

-0.00245 

-0.00690 

-0.03486 


Volume 

Today 





Year 

0.5390 

0.03010 
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Lagl 

0.0409 

-0.02616 

Lag2 

-0.0434 

-0.01025 

Lag3 

-0.0418 

-0.00245 

Lag4 

-0.0484 

-0.00690 

Lag5 

-0.0220 

-0.03486 

Volume 

1.0000 

0.01459 

Today 

0.0146 

1.00000 


As one would expect, the correlations between the lag variables and to¬ 
day’s returns are close to zero. In other words, there appears to be little 
correlation between today’s returns and previous days’ returns. The only 
substantial correlation is between Year and Volume. By plotting the data we 
see that Volume is increasing over time. In other words, the average number 
of shares traded daily increased from 2001 to 2005. 

> attach(Smarket) 

> plot(Volume) 


4-6.2 Logistic Regression 

Next, we will fit a logistic regression model in order to predict Direction 
using Lagl through Lag5 and Volume. The glm() function fits generalized 
linear models , a class of models that includes logistic regression. The syntax 
of the glm() function is similar to that of lm(), except that we must pass in 
the argument f amily=binomial in order to tell R to run a logistic regression 
rather than some other type of generalized linear model. 

> glm.fit=glm(Direction~Lagl+Lag2+Lag3+Lag4+Lag5+Volume, 

data=Smarket ,family=binomial) 

> summary ( glm . f it ) 


Call : 


glm(f ormula 

= Direction 

Lagl + 

Lag2 + 

Lag3 + L; 

+ Volume 

, family = 

binomial , 

data = 

Smarket) 

Deviance Residuals : 




Min 

IQ Median 

3 Q 

Max 


-1.45 -1. 

20 1.07 

1 . 15 

1.33 


Coefficients 






Estimate Std. Error 

z value 

Pr(>|z | ) 

(Intercept) 

-0 . 12600 

0.24074 

-0.52 

0.60 

Lagl 

-0.07307 

0.05017 

-1.46 

0.15 

Lag2 

-0.04230 

0.05009 

-0.84 

0.40 

Lag3 

0.01109 

0.04994 

0.22 

0.82 

Lag4 

0 . 00936 

0.04997 

0.19 

0.85 

Lag5 

0.01031 

0.04951 

0.21 

0.83 

Volume 

0.13544 

0.15836 

0.86 

0.39 


glm() 

generalized 
linear model 
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(Dispersion parameter for binomial family taken to be 1) 

Null deviance: 1731.2 on 1249 degrees of freedom 
Residual deviance: 1727.6 on 1243 degrees of freedom 
AIC: 1742 

Number of Fisher Scoring iterations: 3 

The smallest p-value here is associated with Lagl. The negative coefficient 
for this predictor suggests that if the market had a positive return yesterday, 
then it is less likely to go up today. However, at a value of 0.15, the p-value 
is still relatively large, and so there is no clear evidence of a real association 
between Lagl and Direction. 

We use the coef () function in order to access just the coefficients for this 
fitted model. We can also use the summary () function to access particular 
aspects of the fitted model, such as the p-values for the coefficients. 

> coef(glm.fit) 


!Intercept) 

Lagl 

Lag2 

Lag3 

Lag4 

-0.12600 

-0.07307 

-0.04230 

0.01109 

0.00936 

Lag5 

Volume 




0.01031 

0.13544 




> summary(gl 

m . fit)$ coef 





Estimate Std 

. Error z value 

Pr!>|z | ) 


!Intercept) 

-0.12600 

0.2407 -0.523 

0.601 


Lagl 

-0.07307 

0.0502 -1.457 

0.145 


Lag2 

-0 . 04230 

0.0501 -0.845 

0.398 


Lag3 

0.01109 

0.0499 0.222 

0.824 


Lag4 

0.00936 

0.0500 0.187 

0.851 


Lag5 

0.01031 

0.0495 0.208 

0.835 


Volume 

0.13544 

0.1584 0.855 

0.392 


> summary ( gl 

m.fit)$coef [ ,4] 



(Intercept ) 

Lagl 

Lag2 

Lag3 

Lag4 

0.601 

0.145 

0.398 

0.824 

0.851 

Lag5 

Volume 




0.835 

0.392 





The predict () function can be used to predict the probability that the 
market will go up, given values of the predictors. The type="response" 
option tells R to output probabilities of the form P(Y = 1|A'), as opposed 
to other information such as the logit. If no data set is supplied to the 
predict!) function, then the probabilities are computed for the training 
data that was used to fit the logistic regression model. Here we have printed 
only the first ten probabilities. We know that these values correspond to 
the probability of the market going up, rather than down, because the 
contrasts!) function indicates that R has created a dummy variable with 
a 1 for Up. 

> glm.probs = predict(glm. fit ,type = "response") 

> glm.probs [1:10] 

123456789 10 

0.507 0.481 0.481 0.515 0.511 0.507 0.493 0.509 0.518 0.489 
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> contrasts(Direction) 

Up 

Down 0 
Up 1 

In order to make a prediction as to whether the market will go up or 
down on a particular day, we must convert these predicted probabilities 
into class labels, Up or Down. The following two commands create a vector 
of class predictions based on whether the predicted probability of a market 
increase is greater than or less than 0.5. 

> glm.pred = rep("Down ", 1250) 

> glm . pred [glm . probs > . 5] = " Up " 

The first command creates a vector of 1,250 Down elements. The second line 
transforms to Up all of the elements for which the predicted probability of a 
market increase exceeds 0.5. Given these predictions, the tableO function 
can be used to produce a confusion matrix in order to determine how many 
observations were correctly or incorrectly classified. 

> t able (glm . pred , Direct ion ) 

Direction 
glm.pred Down Up 
Down 145 141 
Up 457 507 

> (507+145)/1250 

[1] 0.5216 

> mean(glm.pred==Direction) 

[1] 0.5216 

The diagonal elements of the confusion matrix indicate correct predictions, 
while the off-diagonals represent incorrect predictions. Hence our model 
correctly predicted that the market would go up on 507 days and that 
it would go down on 145 days, for a total of 507 + 145 = 652 correct 
predictions. The meanO function can be used to compute the fraction of 
days for which the prediction was correct. In this case, logistic regression 
correctly predicted the movement of the market 52.2% of the time. 

At first glance, it appears that the logistic regression model is working 
a little better than random guessing. However, this result is misleading 
because we trained and tested the model on the same set of 1, 250 observa¬ 
tions. In other words, 100 — 52.2 = 47.8% is the training error rate. As we 
have seen previously, the training error rate is often overly optimistic—it 
tends to underestimate the test error rate. In order to better assess the ac¬ 
curacy of the logistic regression model in this setting, we can fit the model 
using part of the data, and then examine how well it predicts the held out 
data. This will yield a more realistic error rate, in the sense that in prac¬ 
tice we will be interested in our model’s performance not on the data that 
we used to fit the model, but rather on days in the future for which the 
market’s movements are unknown. 


table() 
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To implement this strategy, we will first create a vector corresponding 
to the observations from 2001 through 2004. We will then use this vector 
to create a held out data set of observations from 2005. 

> train=(Year<2005) 

> Smarket .2005=Smarket [!train , ] 

> dim(Smarket . 2005) 

[1] 252 9 

> Direction .2005=Direction [!train] 

The object train is a vector of 1,250 elements, corresponding to the ob¬ 
servations in our data set. The elements of the vector that correspond to 
observations that occurred before 2005 are set to TRUE, whereas those that 
correspond to observations in 2005 are set to FALSE. The object train is 
a Boolean vector, since its elements are TRUE and FALSE. Boolean vectors 
can be used to obtain a subset of the rows or columns of a matrix. For 
instance, the command Smarket [train,] would pick out a submatrix of the 
stock market data set, corresponding only to the dates before 2005, since 
those are the ones for which the elements of train are TRUE. The ! symbol 
can be used to reverse all of the elements of a Boolean vector. That is, 
! train is a vector similar to train, except that the elements that are TRUE 
in train get swapped to FALSE in ! train, and the elements that are FALSE 
in train get swapped to TRUE in !train. Therefore, Smarket [! train,] yields 
a submatrix of the stock market data containing only the observations for 
which train is FALSE -that is, the observations with dates in 2005. The 
output above indicates that there are 252 such observations. 

We now fit a logistic regression model using only the subset of the obser¬ 
vations that correspond to dates before 2005, using the subset argument. 
We then obtain predicted probabilities of the stock market going up for 
each of the days in our test set—that is, for the days in 2005. 

> glm.fit=glm(Direction~Lagl+Lag2+Lag3+Lag4+Lag5+Volume, 

data=Smarket ,family=binomial,subset=train) 

> glm.probs=predict(glm.fit,Smarket.2005,type="response") 

Notice that we have trained and tested our model on two completely sep¬ 
arate data sets: training was performed using only the dates before 2005, 
and testing was performed using only the dates in 2005. Finally, we com¬ 
pute the predictions for 2005 and compare them to the actual movements 
of the market over that time period. 

> glm.pred = rep("Down ", 252) 

> glm . pred [glm . probs > . 5] = " Up " 

> table(glm . pred,Direction . 2005) 

Direction . 2005 
glm.pred Down Up 
Down 77 97 
Up 34 44 

> mean(glm.pred = = Direction . 2005) 


boolean 
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[1] 0.48 

> mean(glm.pred!=Direction .2005) 

[1] 0.52 

The ! = notation means not equal to , and so the last command computes 
the test set error rate. The results are rather disappointing: the test error 
rate is 52%, which is worse than random guessing! Of course this result 
is not all that surprising, given that one would not generally expect to be 
able to use previous days’ returns to predict future market performance. 
(After all, if it were possible to do so, then the authors of this book would 
be out striking it rich rather than writing a statistics textbook.) 

We recall that the logistic regression model had very underwhelming p- 
values associated with all of the predictors, and that the smallest p-value, 
though not very small, corresponded to Lagl. Perhaps by removing the 
variables that appear not to be helpful in predicting Direction, we can 
obtain a more effective model. After all, using predictors that have no 
relationship with the response tends to cause a deterioration in the test 
error rate (since such predictors cause an increase in variance without a 
corresponding decrease in bias), and so removing such predictors may in 
turn yield an improvement. Below we have refit the logistic regression using 
just Lagl and Lag2, which seemed to have the highest predictive power in 
the original logistic regression model. 

> glm.fit = glm (Direction~Lagl+Lag2 , data = Smarket ,family = binomial , 

subset =train) 

> glm.probs=predict(glm.fit,Smarket.2005,type="response") 

> glm.pred = rep ("Down ", 252) 

> glm . pr ed [glm . probs > . 5] = " Up " 

> table(glm.pred,Direction . 2005) 

Direction.2005 
glm.pred Down Up 
Down 35 35 

Up 76 106 

> mean(glm.pred = = Direction . 2005) 

[1] 0.56 

> 106/(106+76) 

[1] 0.582 

Now the results appear to be a little better: 56% of the daily movements 
have been correctly predicted. It is worth noting that in this case, a much 
simpler strategy of predicting that the market will increase every day will 
also be correct 56% of the time! Hence, in terms of overall error rate, the 
logistic regression method is no better than the naive approach. However, 
the confusion matrix shows that on days when logistic regression predicts 
an increase in the market, it has a 58% accuracy rate. This suggests a 
possible trading strategy of buying on days when the model predicts an in¬ 
creasing market, and avoiding trades on days when a decrease is predicted. 
Of course one would need to investigate more carefully whether this small 
improvement was real or just due to random chance. 
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Suppose that we want to predict the returns associated with particular 
values of Lagl and Lag2. In particular, we want to predict Direction on a 
day when Lagl and Lag2 equal 1.2 and 1.1, respectively, and on a day when 
they equal 1.5 and —0.8. We do this using the predict() function. 

> predict(glm.fit,newdata=data.frame(Lagl=c(1.2,1.5), 

Lag2=c(1.1,-0.8)),type="response") 

1 2 
0.4791 0.4961 


4-6.3 Linear Discriminant Analysis 

Now we will perform LDA on the Smarket data. In R, we fit a LDA model 
using the lda() function, which is part of the MASS library. Notice that the 
syntax for the lda() function is identical to that of lm(), and to that of 
glm() except for the absence of the family option. We fit the model using 
only the observations before 2005. 

> library(MASS) 

> Ida.fit=lda(Direction~Lagl+Lag2,data=Smarket ,subset=train) 

> Ida.fit 
Call : 

Ida (Direct ion ~ Lagl + Lag2 , data = Smarket , subset = train) 

Prior probabilities of groups: 

Down Up 
0.492 0.508 

Group means: 

Lagl Lag2 
Down 0.0428 0.0339 

Up -0.0395 -0.0313 

Coefficients of linear discriminants: 

LD1 

Lagl -0.642 
Lag2 -0.514 

> plot(Ida.fit) 

The LDA output indicates that 7Ti = 0.492 and tt 2 = 0.508; in other words, 
49.2% of the training observations correspond to days during which the 
market went down. It also provides the group means; these are the average 
of each predictor within each class, and are used by LDA as estimates 
of y,k ■ These suggest that there is a tendency for the previous 2 days’ 
returns to be negative on days when the market increases, and a tendency 
for the previous days’ returns to be positive on days when the market 
declines. The coefficients of linear discriminants output provides the linear 
combination of Lagl and Lag2 that are used to form the LDA decision rule. 
In other words, these are the multipliers of the elements of A = x in 
(4.19). If —0.642 x Lagl — 0.514 x Lag2 is large, then the LDA classifier will 
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predict a market increase, and if it is small, then the LDA classifier will 
predict a market decline. The plotO function produces plots of the linear 
discriminants , obtained by computing —0.642 x Lagl — 0.514 x Lag2 for 
each of the training observations. 

The predict () function returns a list with three elements. The first ele¬ 
ment, class, contains LDA’s predictions about the movement of the market. 
The second element, posterior, is a matrix whose kth column contains the 
posterior probability that the corresponding observation belongs to the kth 
class, computed from (4.10). Finally, x contains the linear discriminants, 
described earlier. 

> Ida.pred = predict(Ida.fit , Smarket .2005) 

> names(Ida.pred) 

[1] "class" "posterior" "x" 

As we observed in Section 4.5, the LDA and logistic regression predictions 
are almost identical. 

> Ida.class=lda.pred$class 

> table(Ida.class ,Direction .2005) 

Direction . 2005 
Ida.pred Down Up 
Down 35 35 

Up 76 106 

> mean(Ida.class==Direction .2005) 

[1] 0.56 

Applying a 50 % threshold to the posterior probabilities allows us to recre¬ 
ate the predictions contained in lda.pred$class. 

> sum(lda.pred$posterior [ , 1]>=.5) 

[1] 70 

> sum(Ida.pred$posterior [ , 1]<.5) 

[1] 182 

Notice that the posterior probability output by the model corresponds to 
the probability that the market will decrease: 

> Ida.pred$posterior[1:20,1] 

> Ida.class [1:20] 

If we wanted to use a posterior probability threshold other than 50 % in 
order to make predictions, then we could easily do so. For instance, suppose 
that we wish to predict a market decrease only if we are very certain that the 
market will indeed decrease on that day—say, if the posterior probability 
is at least 90%. 

> sum(Ida.pred$posterior[,1]>.9) 

[1] 0 

No days in 2005 meet that threshold! In fact, the greatest posterior prob¬ 
ability of decrease in all of 2005 was 52.02%. 
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4-6-4 Quadratic Discriminant Analysis 

We will now fit a QDA model to the Smarket data. QDA is implemented 
in R using the qda() function, which is also part of the MASS library. The 
syntax is identical to that of lda(). 

> qda . f it = qda (Direction~Lagl + Lag2 , data = Smarket , subset = train) 

> qda.fit 
Call : 

qda (Direct ion ~ Lagl + Lag2 , data = Smarket , subset = train) 

Prior probabilities of groups: 

Down Up 
0.492 0.508 

Group means: 

Lagl Lag2 
Down 0.0428 0.0339 

Up -0.0395 -0.0313 

The output contains the group means. But it does not contain the coef¬ 
ficients of the linear discriminants, because the QDA classifier involves a 
quadratic, rather than a linear, function of the predictors. The predict () 
function works in exactly the same fashion as for LDA. 

> qda.class = predict(qda.fit,Smarket .2005) $class 

> table(qda.class ,Direction .2005) 

Direction . 2005 
qda.class Down Up 
Down 30 20 

Up 81 121 

> mean(qda.class==Direction .2005) 

[1] 0.599 

Interestingly, the QDA predictions are accurate almost 60% of the time, 
even though the 2005 data was not used to fit the model. This level of accu¬ 
racy is quite impressive for stock market data, which is known to be quite 
hard to model accurately. This suggests that the quadratic form assumed 
by QDA may capture the true relationship more accurately than the linear 
forms assumed by LDA and logistic regression. However, we recommend 
evaluating this method’s performance on a larger test set before betting 
that this approach will consistently beat the market! 


4-6.5 K-Nearest Neighbors 

We will now perform KNN using the knn() function, which is part of the 
class library. This function works rather differently from the other model¬ 
fitting functions that we have encountered thus far. Rather than a two-step 
approach in which we first fit the model and then we use the model to make 
predictions, knn() forms predictions using a single command. The function 
requires four inputs. 


qda() 


knnO 
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1. A matrix containing the predictors associated with the training data, 
labeled train.X below. 

2. A matrix containing the predictors associated with the data for which 
we wish to make predictions, labeled test.X below. 

3. A vector containing the class labels for the training observations, 
labeled train.Direction below. 

4. A value for K , the number of nearest neighbors to be used by the 
classifier. 

We use the cbindO function, short for column bind , to bind the Lagl and 
Lag2 variables together into two matrices, one for the training set and the 
other for the test set. 

> library(class) 

> train.X = cbind(Lagl,Lag2) [train ,] 

> test.X = cbind(Lagl,Lag2) [!train,] 

> train . Direct ion =Direct ion [train] 

Now the knnO function can be used to predict the market’s movement for 
the dates in 2005. We set a random seed before we apply knn() because 
if several observations are tied as nearest neighbors, then R will randomly 
break the tie. Therefore, a seed must be set in order to ensure reproducibil¬ 
ity of results. 

> set . seed (1) 

> knn.pred = knn(train.X,test.X,train.Direction ,k = l) 

> table(knn.pred,Direct ion . 2005) 

Direction . 2005 
knn.pred Down Up 
Down 43 58 
Up 68 83 

> (83 + 43) /252 

[1] 0.5 

The results using K = 1 are not very good, since only 50 % of the observa¬ 
tions are correctly predicted. Of course, it may be that K = 1 results in an 
overly flexible fit to the data. Below, we repeat the analysis using K = 3. 

> knn.pred = knn(train.X,test.X,train.Direction ,k=3) 

> table(knn.pred,Direct ion . 2005) 

Direction . 2005 
knn.pred Down Up 
Down 48 54 
Up 63 87 

> mean(knn.pred = = Direction . 2005) 

[1] 0.536 

The results have improved slightly. But increasing K further turns out 
to provide no further improvements. It appears that for this data, QDA 
provides the best results of the methods that we have examined so far. 


cbindO 
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4-6.6 An Application to Caravan Insurance Data 

Finally, we will apply the KNN approach to the Caravan data set, which is 
part of the ISLR library. This data set includes 85 predictors that measure 
demographic characteristics for 5,822 individuals. The response variable is 
Purchase, which indicates whether or not a given individual purchases a 
caravan insurance policy. In this data set, only 6 % of people purchased 
caravan insurance. 

> dim(Caravan ) 

[1] 5822 86 

> attach(Caravan) 

> summary(Purchase) 

No Yes 

5474 348 

> 348/5822 

[1] 0.0598 

Because the KNN classifier predicts the class of a given test observation by 
identifying the observations that are nearest to it, the scale of the variables 
matters. Any variables that are on a large scale will have a much larger 
effect on the distance between the observations, and hence on the KNN 
classifier, than variables that are on a small scale. For instance, imagine a 
data set that contains two variables, salary and age (measured in dollars 
and years, respectively). As far as KNN is concerned, a difference of $1,000 
in salary is enormous compared to a difference of 50 years in age. Conse¬ 
quently, salary will drive the KNN classification results, and age will have 
almost no effect. This is contrary to our intuition that a salary difference 
of $1, 000 is quite small compared to an age difference of 50 years. Further¬ 
more, the importance of scale to the KNN classifier leads to another issue: 
if we measured salary in Japanese yen, or if we measured age in minutes, 
then we’d get quite different classification results from what we get if these 
two variables are measured in dollars and years. 

A good way to handle this problem is to standardize the data so that all 
variables are given a mean of zero and a standard deviation of one. Then 
all variables will be on a comparable scale. The scale 0 function does just 
this. In standardizing the data, we exclude column 86, because that is the 
qualitative Purchase variable. 

> standardized.X=scale(Caravan[,-86]) 

> var(Caravan [, 1]) 

[1] 165 

> var (Car avan [ , 2]) 

[1] 0.165 

> var(standardized.X[,1]) 

[1] 1 

> var(standardized.X[,2]) 

[1] 1 

Now every column of standardized.X has a standard deviation of one and 
a mean of zero. 


standardize 

scale() 
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We now split the observations into a test set, containing the first 1,000 
observations, and a training set, containing the remaining observations. 
We fit a KNN model on the training data using K = 1, and evaluate its 
performance on the test data. 

> t e st = 1 : 1000 

> train.X=standardized.X[-test,] 

> test.X=standardized.X[test,] 

> train.Y=Purchase [-test] 

> test.Y=Purchase[test] 

> set.seed (1) 

> knn.pred=knn(train.X,test.X,train.Y,k=l) 

> mean(test.Y!=knn.pred) 

[1] 0.118 

> mean(test.Y!="No") 

[1] 0.059 

The vector test is numeric, with values from 1 through 1,000. Typing 
standardized. X [test,] yields the submatrix of the data containing the ob¬ 
servations whose indices range from 1 to 1, 000, whereas typing 
standardized.X[-test,] yields the submatrix containing the observations 
whose indices do not range from 1 to 1,000. The KNN error rate on the 
1,000 test observations is just under 12%. At first glance, this may ap¬ 
pear to be fairly good. However, since only 6 % of customers purchased 
insurance, we could get the error rate down to 6 % by always predicting No 
regardless of the values of the predictors! 

Suppose that there is some non-trivial cost to trying to sell insurance 
to a given individual. For instance, perhaps a salesperson must visit each 
potential customer. If the company tries to sell insurance to a random 
selection of customers, then the success rate will be only 6%, which may 
be far too low given the costs involved. Instead, the company would like 
to try to sell insurance only to customers who are likely to buy it. So the 
overall error rate is not of interest. Instead, the fraction of individuals that 
are correctly predicted to buy insurance is of interest. 

It turns out that KNN with K = 1 does far better than random guessing 
among the customers that are predicted to buy insurance. Among 77 such 
customers, 9, or 11.7%, actually do purchase insurance. This is double the 
rate that one would obtain from random guessing. 

> table(knn.pred,test.Y) 

test . Y 

knn.pred No Yes 
No 873 50 

Yes 68 9 

> 9/(68+9) 

[1] 0.117 

Using K = 3, the success rate increases to 19 %, and with K = 5 the rate is 
26.7%. This is over four times the rate that results from random guessing. 
It appears that KNN is finding some real patterns in a difficult data set! 
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> knn.pred=knn(train.X,test.X,train.Y,k=3) 

> table(knn.pred,test.Y) 

test . Y 

knn.pred No Yes 
No 920 54 

Yes 21 5 

> 5/26 

[1] 0.192 

> knn.pred=knn(train.X,test.X,train.Y,k=5) 

> table(knn.pred,test.Y) 

test . Y 

knn.pred No Yes 
No 930 55 

Yes 11 4 

> 4/15 

[1] 0.267 

As a comparison, we can also fit a logistic regression model to the data. 
If we use 0.5 as the predicted probability cut-off for the classifier, then 
we have a problem: only seven of the test observations are predicted to 
purchase insurance. Even worse, we are wrong about all of these! However, 
we are not required to use a cut-off of 0.5. If we instead predict a purchase 
any time the predicted probability of purchase exceeds 0.25, we get much 
better results: we predict that 33 people will purchase insurance, and we 
are correct for about 33 % of these people. This is over five times better 
than random guessing! 

> glm.fit = glm(Purchaser. ,data = Caravan ,family = binomial , 

subset =-test) 

Warning message : 

glm.fit: fitted probabilities numerically 0 or 1 occurred 

> glm .probs = predict(glm.fit,Caravan [test ,] ,type = "response") 

> glm.pred = rep("No" , 1000) 

> glm . pred [glm . probs > . 5] = " Yes " 

> table (glm . pred , test . Y) 

test . Y 

glm . pred No Yes 
No 934 59 

Yes 7 0 

> glm.pred = rep ( "No " , 1000) 

> glm.pred[glm.probs >.25]="Yes" 

> table (glm . pred , test . Y) 

test . Y 

glm . pred No Yes 
No 919 48 

Yes 22 11 

> 11/(22+11) 

[1] 0.333 
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4.7 Exercises 

Conceptual 

1. Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In 
other words, the logistic function representation and logit represen¬ 
tation for the logistic regression model are equivalent. 

2. It was stated in the text that classifying an observation to the class 
for which (4.12) is largest is equivalent to classifying an observation 
to the class for which (4.13) is largest. Prove that this is the case. In 
other words, under the assumption that the observations in the kth 
class are drawn from a N(/j,k,c r 2 ) distribution, the Bayes’ classifier 
assigns an observation to the class for which the discriminant function 
is maximized. 

3. This problem relates to the QDA model, in which the observations 
within each class are drawn from a normal distribution with a class- 
specific mean vector and a class specific covariance matrix. We con¬ 
sider the simple case where p = 1; i.e. there is only one feature. 

Suppose that we have K classes, and that if an observation belongs 
to the fcth class then X comes from a one-dimensional normal dis¬ 
tribution, X ~ N{pk,al). Recall that the density function for the 
one-dimensional normal distribution is given in (4.11). Prove that in 
this case, the Bayes’ classifier is not linear. Argue that it is in fact 
quadratic. 

Hint: For this problem, you should follow the arguments laid out in 
Section 4-4.2, but without making the assumption that er 2 = ... = < j 2 k . 

4. When the number of features p is large, there tends to be a deteri¬ 
oration in the performance of KNN and other local approaches that 
perform prediction using only observations that are near the test ob¬ 
servation for which a prediction must be made. This phenomenon is 
known as the curse of dimensionality , and it ties into the fact that 
non-parametric approaches often perform poorly when p is large. We 
will now investigate this curse. 

(a) Suppose that we have a set of observations, each with measure¬ 
ments on p = 1 feature, X. We assume that X is uniformly 
(evenly) distributed on [0,1]. Associated with each observation 
is a response value. Suppose that we wish to predict a test obser¬ 
vation’s response using only observations that are within 10 % of 
the range of X closest to that test observation. For instance, in 
order to predict the response for a test observation with X = 0.6, 
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we will use observations in the range [0.55,0.65]. On average, 
what fraction of the available observations will we use to make 
the prediction? 

(b) Now suppose that we have a set of observations, each with 
measurements on p = 2 features, X\ and X 2 . We assume that 
(X 1 ,X 2 ) are uniformly distributed on [0,1] x [0,1]. We wish to 
predict a test observation’s response using only observations that 
are within 10 % of the range of X\ and within 10 % of the range 
of X 2 closest to that test observation. For instance, in order to 
predict the response for a test observation with X\ = 0.6 and 
X 2 = 0.35, we will use observations in the range [0.55, 0.65] for 
Xi and in the range [0.3, 0.4] for X 2 . On average, what fraction 
of the available observations will we use to make the prediction? 

(c) Now suppose that we have a set of observations on p = 100 fea¬ 
tures. Again the observations are uniformly distributed on each 
feature, and again each feature ranges in value from 0 to 1. We 
wish to predict a test observation’s response using observations 
within the 10 % of each feature’s range that is closest to that test 
observation. What fraction of the available observations will we 
use to make the prediction? 

(d) Using your answers to parts (a)-(c), argue that a drawback of 
KNN when p is large is that there are very few training obser¬ 
vations “near” any given test observation. 

(e) Now suppose that we wish to make a prediction for a test obser¬ 
vation by creating a p-dimensional hypercube centered around 
the test observation that contains, on average, 10 % of the train¬ 
ing observations. For p = 1,2, and 100, what is the length of 
each side of the hypercube? Comment on your answer. 

Note: A hypercube is a generalization of a cube to an arbitrary 
number of dimensions. When p = 1, a hypercube is simply a line 
segment, when p = 2 it is a square, and when p = 100 it is a 
100-dimensional cube. 

We now examine the differences between LDA and QDA. 

(a) If the Bayes decision boundary is linear, do we expect LDA or 
QDA to perform better on the training set? On the test set? 

(b) If the Bayes decision boundary is non-linear, do we expect LDA 
or QDA to perform better on the training set? On the test set? 

(c) In general, as the sample size n increases, do we expect the test 
prediction accuracy of QDA relative to LDA to improve, decline, 
or be unchanged? Why? 
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(d) True or False: Even if the Bayes decision boundary for a given 
problem is linear, we will probably achieve a superior test er¬ 
ror rate using QDA rather than LDA because QDA is flexible 
enough to model a linear decision boundary. Justify your answer. 

6. Suppose we collect data for a group of students in a statistics class 

with variables X\ = hours studied, Xi = undergrad GPA, and Y = 
receive an A. We fit a logistic regression and produce estimated 
coefficient, /3o = — 6,/3i = 0.05, = 1. 

(a) Estimate the probability that a student who studies for 40 h and 
has an undergrad GPA of 3.5 gets an A in the class. 

(b) How many hours would the student in part (a) need to study to 
have a 50 % chance of getting an A in the class? 

7. Suppose that we wish to predict whether a given stock will issue a 
dividend this year (“Yes” or “No”) based on X, last year’s percent 
profit. We examine a large number of companies and discover that the 
mean value of X for companies that issued a dividend was X = 10, 
while the mean for those that didn’t was X = 0. In addition, the 
variance of X for these two sets of companies was a 2 = 36. Finally, 
80 % of companies issued dividends. Assuming that X follows a nor¬ 
mal distribution, predict the probability that a company will issue 
a dividend this year given that its percentage profit was X = 4 last 
year. 



8. Suppose that we take a data set, divide it into equally-sized training 
and test sets, and then try out two different classification procedures. 
First we use logistic regression and get an error rate of 20 % on the 
training data and 30 % on the test data. Next we use 1-nearest neigh¬ 
bors (i.e. K = 1) and get an average error rate (averaged over both 
test and training data sets) of 18 %. Based on these results, which 
method should we prefer to use for classification of new observations? 
Why? 

9. This problem has to do with odds. 

(a) On average, what fraction of people with an odds of 0.37 of 
defaulting on their credit card payment will in fact default? 

(b) Suppose that an individual has a 16% chance of defaulting on 
her credit card payment. What are the odds that she will de¬ 
fault? 
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Applied 

10. This question should be answered using the Weekly data set, which 
is part of the ISLR package. This data is similar in nature to the 
Smarket data from this chapter’s lab, except that it contains 1,089 
weekly returns for 21 years, from the beginning of 1990 to the end of 
2010 . 

(a) Produce some numerical and graphical summaries of the Weekly 
data. Do there appear to be any patterns? 

(b) Use the full data set to perform a logistic regression with 
Direction as the response and the five lag variables plus Volume 
as predictors. Use the summary function to print the results. Do 
any of the predictors appear to be statistically significant? If so, 
which ones? 

(c) Compute the confusion matrix and overall fraction of correct 
predictions. Explain what the confusion matrix is telling you 
about the types of mistakes made by logistic regression. 

(d) Now fit the logistic regression model using a training data period 
from 1990 to 2008, with Lag2 as the only predictor. Compute the 
confusion matrix and the overall fraction of correct predictions 
for the held out data (that is, the data from 2009 and 2010). 

(e) Repeat (d) using LDA. 

(f) Repeat (d) using QDA. 

(g) Repeat (d) using KNN with K = 1. 

(h) Which of these methods appears to provide the best results on 
this data? 

(i) Experiment with different combinations of predictors, includ¬ 
ing possible transformations and interactions, for each of the 
methods. Report the variables, method, and associated confu¬ 
sion matrix that appears to provide the best results on the held 
out data. Note that you should also experiment with values for 
K in the KNN classifier. 

11. In this problem, you will develop a model to predict whether a given 
car gets high or low gas mileage based on the Auto data set. 

(a) Create a binary variable, mpgOl, that contains a 1 if mpg contains 
a value above its median, and a 0 if mpg contains a value below 
its median. You can compute the median using the medianO 
function. Note you may find it helpful to use the data.frame() 
function to create a single data set containing both mpgOl and 
the other Auto variables. 
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(b) Explore the data graphically in order to investigate the associ¬ 
ation between mpgOl and the other features. Which of the other 
features seem most likely to be useful in predicting mpgOl? Scat- 
terplots and boxplots may be useful tools to answer this ques¬ 
tion. Describe your findings. 

(c) Split the data into a training set and a test set. 

(d) Perform LDA on the training data in order to predict mpgOl 
using the variables that seemed most associated with mpgOl in 

(b). What is the test error of the model obtained? 

(e) Perform QDA on the training data in order to predict mpgOl 
using the variables that seemed most associated with mpgOl in 
(b). What is the test error of the model obtained? 

(f) Perform logistic regression on the training data in order to pre¬ 
dict mpgOl using the variables that seemed most associated with 
mpgOl in (b). What is the test error of the model obtained? 

(g) Perform KNN on the training data, with several values of K 1 in 
order to predict mpgOl. Use only the variables that seemed most 
associated with mpgOl in (b). What test errors do you obtain? 
Which value of K seems to perform the best on this data set? 

12. This problem involves writing functions. 

(a) Write a function, Power (), that prints out the result of raising 2 
to the 3rd power. In other words, your function should compute 
2 3 and print out the results. 

Hint: Recall that x"a raises x to the power a. Use the print () 
function to output the result. 

(b) Create a new function, Power2(), that allows you to pass any 
two numbers, x and a, and prints out the value of x~a. You can 
do this by beginning your function with the line 

> Power2=function(x,a){ 

You should be able to call your function by entering, for instance, 

> Power2(3,8) 

on the command line. This should output the value of 3 s , namely, 
6,561. 

(c) Using the Power2() function that you just wrote, compute 10 3 , 
8 17 , and 131 3 . 

(d) Now create a new function, Power3(), that actually returns the 
result x~a as an R object, rather than simply printing it to the 
screen. That is, if you store the value x~a in an object called 
result within your function, then you can simply return () this 
result, using the following line: 


return () 
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return(result) 

The line above should be the last line in your function, before 
the } symbol. 

(e) Now using the Power3() function, create a plot of f{x) = x 2 . 
The a;-axis should display a range of integers from 1 to 10, and 
the y-axis should display xr . Label the axes appropriately, and 
use an appropriate title for the figure. Consider displaying either 
the x-axis, the y-axis, or both on the log-scale. You can do this 
by using log^'x’’, log=‘‘y’’, or log=‘‘xy’’ as arguments to 
the plotO function. 

(f) Create a function, PlotPowerO, that allows you to create a plot 
of x against x~a for a fixed a and for a range of values of x. For 
instance, if you call 

> PlotPower ( 1:10,3) 

then a plot should be created with an x-axis taking on values 
1,2,..., 10, and a y-axis taking on values l 3 , 2 3 ,..., 10 3 . 

13. Using the Boston data set, fit classification models in order to predict 
whether a given suburb has a crime rate above or below the median. 
Explore logistic regression, LDA, and KNN models using various sub¬ 
sets of the predictors. Describe your findings. 
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Resampling Methods 


Resampling methods are an indispensable tool in modem statistics. They 
involve repeatedly drawing samples from a training set and refitting a model 
of interest on each sample in order to obtain additional information about 
the fitted model. For example, in order to estimate the variability of a linear 
regression fit, we can repeatedly draw different samples from the training 
data, fit a linear regression to each new sample, and then examine the 
extent to which the resulting fits differ. Such an approach may allow us to 
obtain information that would not be available from fitting the model only 
once using the original training sample. 

Resampling approaches can be computationally expensive, because they 
involve fitting the same statistical method multiple times using different 
subsets of the training data. However, due to recent advances in computing 
power, the computational requirements of resampling methods generally 
are not prohibitive. In this chapter, we discuss two of the most commonly 
used resampling methods, cross-validation and the bootstrap. Both methods 
are important tools in the practical application of many statistical learning 
procedures. For example, cross-validation can be used to estimate the test 
error associated with a given statistical learning method in order to evaluate 
its performance, or to select the appropriate level of flexibility. The process 
of evaluating a model’s performance is known as model assessment, whereas 
the process of selecting the proper level of flexibility for a model is known as 
model selection. The bootstrap is used in several contexts, most commonly 
to provide a measure of accuracy of a parameter estimate or of a given 
statistical learning method. 


G. James et al., An Introduction to Statistical Learning: with Applications in R, 175 
Springer Texts in Statistics, DOI 10.1007/978-1-4614-7138-7—5, 
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5.1 Cross-Validation 


In Chapter 2 we discuss the distinction between the test error rate and the 
training error rate. The test error is the average error that results from using 
a statistical learning method to predict the response on a new observation— 
that is, a measurement that was not used in training the method. Given 
a data set, the use of a particular statistical learning method is warranted 
if it results in a low test error. The test error can be easily calculated if a 
designated test set is available. Unfortunately, this is usually not the case. 
In contrast, the training error can be easily calculated by applying the 
statistical learning method to the observations used in its training. But as 
we saw in Chapter 2, the training error rate often is quite different from the 
test error rate, and in particular the former can dramatically underestimate 
the latter. 

In the absence of a very large designated test set that can be used to 
directly estimate the test error rate, a number of techniques can be used 
to estimate this quantity using the available training data. Some methods 
make a mathematical adjustment to the training error rate in order to 
estimate the test error rate. Such approaches are discussed in Chapter 6. 
In this section, we instead consider a class of methods that estimate the 
test error rate by holding out a subset of the training observations from the 
fitting process, and then applying the statistical learning method to those 
held out observations. 

In Sections 5.1.1-5.1.4, for simplicity we assume that we are interested 
in performing regression with a quantitative response. In Section 5.1.5 we 
consider the case of classification with a qualitative response. As we will 
see, the key concepts remain the same regardless of whether the response 
is quantitative or qualitative. 

5.1.1 The Validation Set Approach 


validation 
set approach 

validation 

set 

hold-out set 


Suppose that we would like to estimate the test error associated with fit¬ 
ting a particular statistical learning method on a set of observations. The 
validation set approach , displayed in Figure 5.1, is a very simple strategy 
for this task. It involves randomly dividing the available set of observa¬ 
tions into two parts, a training set and a validation set or hold-out set. The 
model is fit on the training set, and the fitted model is used to predict the 
responses for the observations in the validation set. The resulting validation 
set error rate—typically assessed using MSE in the case of a quantitative 
response—provides an estimate of the test error rate. 

We illustrate the validation set approach on the Auto data set. Recall from 
Chapter 3 that there appears to be a non-linear relationship between mpg 
and horsepower, and that a model that predicts mpg using horsepower and 
horsepower 2 gives better results than a model that uses only a linear term. 
It is natural to wonder whether a cubic or higher-order fit might provide 
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FIGURE 5.1. A schematic display of the validation set approach. A set of n 
observations are randomly split into a training set (shown in blue, containing 
observations 7, 22, and 13, among others) and a validation set (shown in beige, 
and containing observation 91, among others). The statistical learning method is 
fit on the training set, and its performance is evaluated on the validation set. 


even better results. We answer this question in Chapter 3 by looking at 
the p-values associated with a cubic term and higher-order polynomial 
terms in a linear regression. But we could also answer this question using 
the validation method. We randomly split the 392 observations into two 
sets, a training set containing 196 of the data points, and a validation set 
containing the remaining 196 observations. The validation set error rates 
that result from fitting various regression models on the training sample 
and evaluating their performance on the validation sample, using MSE 
as a measure of validation set error, are shown in the left-hand panel of 
Figure 5.2. The validation set MSE for the quadratic fit is considerably 
smaller than for the linear fit. However, the validation set MSE for the cubic 
fit is actually slightly larger than for the quadratic fit. This implies that 
including a cubic term in the regression does not lead to better prediction 
than simply using a quadratic term. 

Recall that in order to create the left-hand panel of Figure 5.2, we ran¬ 
domly divided the data set into two parts, a training set and a validation 
set. If we repeat the process of randomly splitting the sample set into two 
parts, we will get a somewhat different estimate for the test MSE. As an 
illustration, the right-hand panel of Figure 5.2 displays ten different vali¬ 
dation set MSE curves from the Auto data set, produced using ten different 
random splits of the observations into training and validation sets. All ten 
curves indicate that the model with a quadratic term has a dramatically 
smaller validation set MSE than the model with only a linear term. Fur¬ 
thermore, all ten curves indicate that there is not much benefit in including 
cubic or higher-order polynomial terms in the model. But it is worth noting 
that each of the ten curves results in a different test MSE estimate for each 
of the ten regression models considered. And there is no consensus among 
the curves as to which model results in the smallest validation set MSE. 
Based on the variability among these curves, all that we can conclude with 
any confidence is that the linear fit is not adequate for this data. 

The validation set approach is conceptually simple and is easy to imple¬ 
ment. But it has two potential drawbacks: 
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FIGURE 5.2. The validation set approach was used on the Auto data set in 
order to estimate the test error that results from predicting mpg using polynomial 
functions of horsepower. Left: Validation error estimates for a single split into 
training and validation data sets. Right: The validation method was repeated ten 
times, each time using a different random split of the observations into a training 
set and a validation set. This illustrates the variability in the estimated test MSE 
that results from this approach. 


1. As is shown in the right-hand panel of Figure 5.2, the validation esti¬ 
mate of the test error rate can be highly variable, depending on pre¬ 
cisely which observations are included in the training set and which 
observations are included in the validation set. 

2. In the validation approach, only a subset of the observations—those 
that are included in the training set rather than in the validation 
set—are used to fit the model. Since statistical methods tend to per¬ 
form worse when trained on fewer observations, this suggests that the 
validation set error rate may tend to overestimate the test error rate 
for the model fit on the entire data set. 

In the coming subsections, we will present cross-validation , a refinement of 
the validation set approach that addresses these two issues. 

5.1.2 Leave-One-Out Cross-Validation 

Leave-one-out cross-validation (LOOCV) is closely related to the validation 
set approach of Section 5.1.1, but it attempts to address that method’s 
drawbacks. 

Like the validation set approach, LOOCV involves splitting the set of 
observations into two parts. However, instead of creating two subsets of 
comparable size, a single observation (xi,yi) is used for the validation 
set, and the remaining observations {(^ 2 , 2 / 2 ), ■ ■ •, (x n , y n )} make up the 
training set. The statistical learning method is fit on the n — 1 training 
observations, and a prediction yi is made for the excluded observation, 
using its value X\. Since [x \, yi) was not used in the fitting process, MSEi = 


leave-one- 

out 

cross- 

validation 
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FIGURE 5.3. A schematic display of LOOCV. A set of n data points is repeat¬ 
edly split into a training set (shown in blue) containing all but one observation, 
and a validation set that contains only that observation (shown in beige). The test 
error is then estimated by averaging the n resulting MSE’s. The first training set 
contains all but observation 1, the second training set contains all but observation 
2, and so forth. 


(yi — i/i) 2 provides an approximately unbiased estimate for the test error. 
But even though MSEi is unbiased for the test error, it is a poor estimate 
because it is highly variable, since it is based upon a single observation 

We can repeat the procedure by selecting (£ 2 , 1 / 2 ) f° r the validation 
data, training the statistical learning procedure on the n — 1 observations 
{(£i,?/i), (X 3 , 2 / 3 ), • • •, (x n , y n )}, and computing MSE 2 = ( 2 / 2 -(a) 2 - Repeat¬ 
ing this approach n times produces n squared errors, MSEi,..., MSE„. 
The LOOCV estimate for the test MSE is the average of these n test error 
estimates: 


1 

CV (n) = - VmSE 2 . (5.1) 

i— 1 

A schematic of the LOOCV approach is illustrated in Figure 5.3. 

LOOCV has a couple of major advantages over the validation set ap¬ 
proach. First, it has far less bias. In LOOCV, we repeatedly fit the sta¬ 
tistical learning method using training sets that contain n — 1 observa¬ 
tions, almost as many as are in the entire data set. This is in contrast to 
the validation set approach, in which the training set is typically around 
half the size of the original data set. Consequently, the LOOCV approach 
tends not to overestimate the test error rate as much as the validation 
set approach does. Second, in contrast to the validation approach which 
will yield different results when applied repeatedly due to randomness in 
the training/validation set splits, performing LOOCV multiple times will 
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FIGURE 5.4. Cross-validation was used on the Auto data set in order to es¬ 
timate the test error that results from predicting mpg using polynomial functions 
of horsepower. Left: The LOOCV error curve. Right: 10-fold CV was run nine 
separate times, each with a different random split of the data into ten parts. The 
figure shows the nine slightly different CV error curves. 


always yield the same results: there is no randomness in the training/vali¬ 
dation set splits. 

We used LOOCV on the Auto data set in order to obtain an estimate 
of the test set MSE that results from fitting a linear regression model to 
predict mpg using polynomial functions of horsepower. The results are shown 
in the left-hand panel of Figure 5.4. 

LOOCV has the potential to be expensive to implement, since the model 
has to be fit n times. This can be very time consuming if n is large, and if 
each individual model is slow to fit. With least squares linear or polynomial 
regression, an amazing shortcut makes the cost of LOOCV the same as that 
of a single model fit! The following formula holds: 


cv <«> = -£ 


Vi ~ Vi 
1 - hi 


(5.2) 


where iji is the ith fitted value from the original least squares fit, and hi is 
the leverage defined in (3.37) on page 98. This is like the ordinary MSE, 
except the ith residual is divided by 1 — hi. The leverage lies between 1/n 
and 1, and reflects the amount that an observation influences its own fit. 
Hence the residuals for high-leverage points are inflated in this formula by 
exactly the right amount for this equality to hold. 

LOOCV is a very general method, and can be used with any kind of 
predictive modeling. For example we could use it with logistic regression 
or linear discriminant analysis, or any of the methods discussed in later 
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FIGURE 5.5. A schematic display of 5-fold CV. A set of n observations is 
randomly split into five non-overlapping groups. Each of these fifths acts as a 
validation set (shown in beige), and the remainder as a training set (shown in 
blue). The test error is estimated by averaging the five resulting MSE estimates. 

chapters. The magic formula (5.2) does not hold in general, in which case 
the model has to be refit n times. 

5.1.3 k-Fold Cross-Validation 

An alternative to LOOCV is k-fold CV. This approach involves randomly A f cv 
dividing the set of observations into k groups, or folds , of approximately 
equal size. The first fold is treated as a validation set, and the method 
is fit on the remaining k — 1 folds. The mean squared error, MSEi, is 
then computed on the observations in the held-out fold. This procedure is 
repeated k times; each time, a different group of observations is treated 
as a validation set. This process results in k estimates of the test error, 

MSEi, MSE 2 ,..., MSEfc. The fc-fold CV estimate is computed by averaging 
these values, 

k 

CV W = lE MSE i (5-3) 

^ i=l 

Figure 5.5 illustrates the fc-fold CV approach. 

It is not hard to see that LOOCV is a special case of fc-folcl CV in which k 
is set to equal n. In practice, one typically performs /c-fold CV using k = 5 
or k = 10. What is the advantage of using k = 5 or k = 10 rather than 
k = nl The most obvious advantage is computational. LOOCV requires 
fitting the statistical learning method n times. This has the potential to be 
computationally expensive (except for linear models fit by least squares, 
in which case formula (5.2) can be used). But cross-validation is a very 
general approach that can be applied to almost any statistical learning 
method. Some statistical learning methods have computationally intensive 
fitting procedures, and so performing LOOCV may pose computational 
problems, especially if n is extremely large. In contrast, performing 10-fold 
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FIGURE 5.6. True and estimated test MSE for the simulated data sets in Fig¬ 
ures 2.9 (deft,), 2.10 (center), and 2.11 fright,). The true test MSE is shown in 
blue, the LOOCV estimate is shown as a black dashed line, and the 10 -fold CV 
estimate is shown in orange. The crosses indicate the minimum of each of the 
MSE curves. 


CV requires fitting the learning procedure only ten times, which may be 
much more feasible. As we see in Section 5.1.4, there also can be other 
non-computational advantages to performing 5-fold or 10-fold CV, which 
involve the bias-variance trade-off. 

The right-hand panel of Figure 5.4 displays nine different 10-fold CV 
estimates for the Auto data set, each resulting from a different random 
split of the observations into ten folds. As we can see from the figure, there 
is some variability in the CV estimates as a result of the variability in how 
the observations are divided into ten folds. But this variability is typically 
much lower than the variability in the test error estimates that results from 
the validation set approach (right-hand panel of Figure 5.2). 

When we examine real data, we do not know the true test MSE, and 
so it is difficult to determine the accuracy of the cross-validation estimate. 
However, if we examine simulated data, then we can compute the true 
test MSE, and can thereby evaluate the accuracy of our cross-validation 
results. In Figure 5.6, we plot the cross-validation estimates and true test 
error rates that result from applying smoothing splines to the simulated 
data sets illustrated in Figures 2.9-2.11 of Chapter 2. The true test MSE 
is displayed in blue. The black dashed and orange solid lines respectively 
show the estimated LOOCV and 10-fold CV estimates. In all three plots, 
the two cross-validation estimates are very similar. In the right-hand panel 
of Figure 5.6, the true test MSE and the cross-validation curves are almost 
identical. In the center panel of Figure 5.6, the two sets of curves are similar 
at the lower degrees of flexibility, while the CV curves overestimate the test 
set MSE for higher degrees of flexibility. In the left-hand panel of Figure 5.6, 
the CV curves have the correct general shape, but they underestimate the 
true test MSE. 
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When we perform cross-validation, our goal might be to determine how 
well a given statistical learning procedure can be expected to perform on 
independent data; in this case, the actual estimate of the test MSE is 
of interest. But at other times we are interested only in the location of 
the minimum point in the estimated test MSE curve. This is because we 
might be performing cross-validation on a number of statistical learning 
methods, or on a single method using different levels of flexibility, in order 
to identify the method that results in the lowest test error. For this purpose, 
the location of the minimum point in the estimated test MSE curve is 
important, but the actual value of the estimated test MSE is not. We find 
in Figure 5.6 that despite the fact that they sometimes underestimate the 
true test MSE, all of the CV curves come close to identifying the correct 
level of flexibility—that is, the flexibility level corresponding to the smallest 
test MSE. 


5.1.4 Bias-Variance Trade-Off for k-Fold Cross-Validation 

We mentioned in Section 5.1.3 that fc-fold CV with k < n has a compu¬ 
tational advantage to LOOCV. But putting computational issues aside, 
a less obvious but potentially more important advantage of fc-fold CV is 
that it often gives more accurate estimates of the test error rate than does 
LOOCV. This has to do with a bias-variance trade-off. 

It was mentioned in Section 5.1.1 that the validation set approach can 
lead to overestimates of the test error rate, since in this approach the 
training set used to fit the statistical learning method contains only half 
the observations of the entire data set. Using this logic, it is not hard to 
see that LOOCV will give approximately unbiased estimates of the test 
error, since each training set contains n — 1 observations, which is almost 
as many as the number of observations in the full data set. And performing 
fc-fold CV for, say, fc = 5 or fc = 10 will lead to an intermediate level of 
bias, since each training set contains (fc — 1 )n/k observations—fewer than 
in the LOOCV approach, but substantially more than in the validation set 
approach. Therefore, from the perspective of bias reduction, it is clear that 
LOOCV is to be preferred to fc-fold CV. 

However, we know that bias is not the only source for concern in an esti¬ 
mating procedure; we must also consider the procedure’s variance. It turns 
out that LOOCV has higher variance than does fc-fold CV with fc < n. Why 
is this the case? When we perform LOOCV, we are in effect averaging the 
outputs of n fitted models, each of which is trained on an almost identical 
set of observations; therefore, these outputs are highly (positively) corre¬ 
lated with each other. In contrast, when we perform fc-fold CV with fc < n, 
we are averaging the outputs of fc fitted models that are somewhat less 
correlated with each other, since the overlap between the training sets in 
each model is smaller. Since the mean of many highly correlated quantities 
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has higher variance than does the mean of many quantities that are not 
as highly correlated, the test error estimate resulting from LOOCV tends 
to have higher variance than does the test error estimate resulting from 
fc-fold CV. 

To summarize, there is a bias-variance trade-off associated with the 
choice of k in fc-fold cross-validation. Typically, given these considerations, 
one performs fc-fold cross-validation using k = 5 or k = 10, as these values 
have been shown empirically to yield test error rate estimates that suffer 
neither from excessively high bias nor from very high variance. 

5.1.5 Cross-Validation on Classification Problems 

In this chapter so far, we have illustrated the use of cross-validation in the 
regression setting where the outcome Y is quantitative, and so have used 
MSE to quantify test error. But cross-validation can also be a very useful 
approach in the classification setting when Y is qualitative. In this setting, 
cross-validation works just as described earlier in this chapter, except that 
rather than using MSE to quantify test error, we instead use the number 
of misclassified observations. For instance, in the classification setting, the 
LOOCV error rate takes the form 



(5.4) 


where Errj = I(yi V Vi)- The /c-fold CV error rate and validation set error 
rates are defined analogously. 

As an example, we fit various logistic regression models on the two- 
dimensional classification data displayed in Figure 2.13. In the top-left 
panel of Figure 5.7, the black solid line shows the estimated decision bound¬ 
ary resulting from fitting a standard logistic regression model to this data 
set. Since this is simulated data, we can compute the true test error rate, 
which takes a value of 0.201 and so is substantially larger than the Bayes 
error rate of 0.133. Clearly logistic regression does not have enough flexi¬ 
bility to model the Bayes decision boundary in this setting. We can easily 
extend logistic regression to obtain a non-linear decision boundary by using 
polynomial functions of the predictors, as we did in the regression setting in 
Section 3.3.2. For example, we can fit a quadratic logistic regression model, 
given by 



A> + PiX i + / 3 2 Xi 2 + P3X2 + A iX. 


■2 


(5.5) 


The top-right panel of Figure 5.7 displays the resulting decision boundary, 
which is now curved. However, the test error rate has improved only slightly, 
to 0.197. A much larger improvement is apparent in the bottom-left panel 
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FIGURE 5.7. Logistic regression fits on the two-dimensional classification data 
displayed in Figure 2.13. The Bayes decision boundary is represented using a 
purple dashed line. Estimated decision boundaries from linear, quadratic, cubic 
and quartic (degrees l-f) logistic regressions are displayed in black. The test error 
rates for the four logistic regression fits are respectively 0.201, 0.197, 0.160, and 
0.162, while the Bayes error rate is 0.133. 


of Figure 5.7, in which we have fit a logistic regression model involving 
cubic polynomials of the predictors. Now the test error rate has decreased 
to 0.160. Going to a quartic polynomial (bottom-right) slightly increases 
the test error. 

In practice, for real data, the Bayes decision boundary and the test er¬ 
ror rates are unknown. So how might we decide between the four logistic 
regression models displayed in Figure 5.7? We can use cross-validation in 
order to make this decision. The left-hand panel of Figure 5.8 displays in 
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FIGURE 5.8. Test error (brown), training error (blue), and 10 -fold CV error 
(black) on the two-dimensional classification data displayed in Figure 5.7. Left: 
Logistic regression using polynomial functions of the predictors. The order of 
the polynomials used is displayed on the x-axis. Right: The KNN classifier with 
different values of K, the number of neighbors used in the KNN classifier. 


black the 10-fold CV error rates that result from fitting ten logistic regres¬ 
sion models to the data, using polynomial functions of the predictors up 
to tenth order. The true test errors are shown in brown, and the training 
errors are shown in blue. As we have seen previously, the training error 
tends to decrease as the flexibility of the fit increases. (The figure indicates 
that though the training error rate doesn’t quite decrease monotonically, 
it tends to decrease on the whole as the model complexity increases.) In 
contrast, the test error displays a characteristic U-shape. The 10-fold CV 
error rate provides a pretty good approximation to the test error rate. 
While it somewhat underestimates the error rate, it reaches a minimum 
when fourth-order polynomials are used, which is very close to the min¬ 
imum of the test curve, which occurs when third-order polynomials are 
used. In fact, using fourth-order polynomials would likely lead to good test 
set performance, as the true test error rate is approximately the same for 
third, fourth, fifth, and sixth-order polynomials. 

The right-hand panel of Figure 5.8 displays the same three curves us¬ 
ing the KNN approach for classification, as a function of the value of K 
(which in this context indicates the number of neighbors used in the KNN 
classifier, rather than the number of CV folds used). Again the training 
error rate declines as the method becomes more flexible, and so we see that 
the training error rate cannot be used to select the optimal value for K. 
Though the cross-validation error curve slightly underestimates the test 
error rate, it takes on a minimum very close to the best value for K. 
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5.2 The Bootstrap 


The bootstrap is a widely applicable and extremely powerful statistical tool 
that can be used to quantify the uncertainty associated with a given esti¬ 
mator or statistical learning method. As a simple example, the bootstrap 
can be used to estimate the standard errors of the coefficients from a linear 
regression fit. In the specific case of linear regression, this is not particularly 
useful, since we saw in Chapter 3 that standard statistical software such as 
R outputs such standard errors automatically. However, the power of the 
bootstrap lies in the fact that it can be easily applied to a wide range of 
statistical learning methods, including some for which a measure of vari¬ 
ability is otherwise difficult to obtain and is not automatically output by 
statistical software. 

In this section we illustrate the bootstrap on a toy example in which we 
wish to determine the best investment allocation under a simple model. 
In Section 5.3 we explore the use of the bootstrap to assess the variability 
associated with the regression coefficients in a linear model fit. 

Suppose that we wish to invest a fixed sum of money in two financial 
assets that yield returns of X and Y, respectively, where X and Y are 
random quantities. We will invest a fraction a of our money in X , and will 
invest the remaining 1 — a in Y. Since there is variability associated with 
the returns on these two assets, we wish to choose a to minimize the total 
risk, or variance, of our investment. In other words, we want to minimize 
Var(aA' + (1 — a)Y). One can show that the value that minimizes the risk 
is given by 


a = 


oy — <txy 
a\ + <jy - 2 a X y ' 


(5.6) 


where a\ = Var(X), erf. = Var(F), and oxy = Cov(X, Y). 

In reality, the quantities o x , a \-, and <Jxy are unknown. We can compute 
estimates for these quantities, a x , fry. and ax y, using a data set that 
contains past measurements for X and Y. We can then estimate the value 
of a that minimizes the variance of our investment using 


a = 


-2 - 

cry — axr 

a\ + a\ - 2cr xy ' 


(5.7) 


Figure 5.9 illustrates this approach for estimating a on a simulated data 
set. In each panel, we simulated 100 pairs of returns for the investments 
X and Y. We used these returns to estimate o x ,Oy, and oxy ■ which we 
then substituted into (5.7) in order to obtain estimates for a. The value of 
a resulting from each simulated data set ranges from 0.532 to 0.657. 

It is natural to wish to quantify the accuracy of our estimate of a. To 
estimate the standard deviation of a, we repeated the process of simu¬ 
lating 100 paired observations of X and Y, and estimating a using (5.7), 


bootstrap 
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FIGURE 5.9. Each panel displays 100 simulated returns for investments 
X and Y. From left to right and top to bottom, the resulting estimates for a 
are 0.576, 0.532, 0.657, and 0.651. 


1,000 times. We thereby obtained 1,000 estimates for a , which we can call 
&i, & 2 , • ■ •, di.ooo- The left-hand panel of Figure 5.10 displays a histogram 
of the resulting estimates. For these simulations the parameters were set to 
(Tx = 1, cry = 1.25, and <jxy = 0.5, and so we know that the true value of 
a is 0.6. We indicated this value using a solid vertical line on the histogram. 
The mean over all 1,000 estimates for a is 


a = 


1 

1,000 


1,000 

Y 

r— 1 


0.5996, 


very close to a = 0.6, and the standard deviation of the estimates is 


1,000 

, - V (& r - af = 0.083. 

\ 1,000- 1 ^ v ; 

This gives us a very good idea of the accuracy of a: SE(d) rts 0.083. So 
roughly speaking, for a random sample from the population, we would 
expect a to differ from a by approximately 0.08, on average. 

In practice, however, the procedure for estimating SE(d) outlined above 
cannot be applied, because for real data we cannot generate new samples 
from the original population. However, the bootstrap approach allows us 
to use a computer to emulate the process of obtaining new sample sets, 
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a a 


FIGURE 5.10. Left: A histogram of the estimates of a obtained by generating 
1,000 simulated data sets from the true population. Center: A histogram of the 
estimates of a obtained from 1,000 bootstrap samples from a single data set. 
Right: The estimates of a displayed in the left and center panels are shown as 
boxplots. In each panel, the pink line indicates the true value of a. 


so that we can estimate the variability of a without generating additional 
samples. Rather than repeatedly obtaining independent data sets from the 
population, we instead obtain distinct data sets by repeatedly sampling 
observations from the original data set. 

This approach is illustrated in Figure 5.11 on a simple data set, which 
we call Z , that contains only n = 3 observations. We randomly select n 
observations from the data set in order to produce a bootstrap data set, 

Z* 1 . The sampling is performed with replacement, which means that the 

replacement 

same observation can occur more than once in the bootstrap data set. In 
this example, Z* 1 contains the third observation twice, the first observation 
once, and no instances of the second observation. Note that if an observation 
is contained in if* 1 , then both its X and Y values are included. We can use 
Z* x to produce a new bootstrap estimate for a, which we call d* 1 . This 
procedure is repeated B times for some large value of B, in order to produce 
B different bootstrap data sets, Z* 1 , Z * 2 ,..., Z * B , and B corresponding a 
estimates, a* 1 , a* 2 ,... ,a* B . We can compute the standard error of these 
bootstrap estimates using the formula 


SE s (d) 


\ 


B - 1 


B 


E 



i 

B 



2 


(5.8) 


This serves as an estimate of the standard error of a estimated from the 
original data set. 

The bootstrap approach is illustrated in the center panel of Figure 5.10, 
which displays a histogram of 1,000 bootstrap estimates of a , each com¬ 
puted using a distinct bootstrap data set. This panel was constructed on 
the basis of a single data set, and hence could be created using real data. 
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FIGURE 5.11. A graphical illustration of the bootstrap approach on a small 
sample containing n = 3 observations. Each bootstrap data set contains n obser¬ 
vations, sampled with replacement from the original data set. Each bootstrap data 
set is used to obtain an estimate of a. 


Note that the histogram looks very similar to the left-hand panel which dis¬ 
plays the idealized histogram of the estimates of a obtained by generating 
1,000 simulated data sets from the true population. In particular the boot¬ 
strap estimate SE(d) from (5.8) is 0.087, very close to the estimate of 0.083 
obtained using 1,000 simulated data sets. The right-hand panel displays the 
information in the center and left panels in a different way, via boxplots of 
the estimates for a obtained by generating 1,000 simulated data sets from 
the true population and using the bootstrap approach. Again, the boxplots 
are quite similar to each other, indicating that the bootstrap approach can 
be used to effectively estimate the variability associated with a. 


5.3 Lab: Cross-Validation and the Bootstrap 

In this lab, we explore the resampling techniques covered in this chapter. 
Some of the commands in this lab may take a while to run on your com¬ 
puter. 
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5.3.1 The Validation Set Approach 

We explore the use of the validation set approach in order to estimate the 
test error rates that result from fitting various linear models on the Auto 
data set. 

Before we begin, we use the set.seedO function in order to set a seed for 
R’s random number generator, so that the reader of this book will obtain 
precisely the same results as those shown below. It is generally a good idea 
to set a random seed when performing an analysis such as cross-validation 
that contains an element of randomness, so that the results obtained can 
be reproduced precisely at a later time. 

We begin by using the sample () function to split the set of observations 
into two halves, by selecting a random subset of 196 observations out of 
the original 392 observations. We refer to these observations as the training 
set. 

> library(ISLR) 

> set.seed (1) 

> train = sample(392,196) 

(Here we use a shortcut in the sample command; see ?sample for details.) 
We then use the subset option in lm() to fit a linear regression using only 
the observations corresponding to the training set. 

> lm.fit = lm(mpg~horsepower,dat a = Auto,subset = train) 

We now use the predict() function to estimate the response for all 392 
observations, and we use the meanO function to calculate the MSE of the 
196 observations in the validation set. Note that the -train index below 
selects only the observations that are not in the training set. 

> attach(Auto) 

> mean((mpg-predict(lm.fit,Auto)) [-train]~2) 

[1] 26.14 

Therefore, the estimated test MSE for the linear regression fit is 26.14. We 
can use the polyO function to estimate the test error for the polynomial 
and cubic regressions. 

> lm.fit2 = lm(mpg^poly (horsepower ,2) ,data = Auto,subset=train) 

> mean((mpg-predict(lm.fit2,Auto)) [-train]~2) 

[1] 19.82 

> lm.fit3 = lm(mpg^poly(horsepower ,3) ,data = Auto,subset=train) 

> mean((mpg-predict(lm.fit3,Auto)) [-train]~2) 

[1] 19.78 

These error rates are 19.82 and 19.78, respectively. If we choose a different 
training set instead, then we will obtain somewhat different errors on the 
validation set. 

> set.seed(2) 

> train = sample(392,196) 

> lm . f it =lm (mpg'-'dior sepower , subset=train) 


seed 


sample() 
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> mean((mpg-predict(lm.fit,Auto))[-train]~ 2) 

[1] 23.30 

> lm.fit2 = lm(mpg~poly(horsepower ,2) ,data = Auto,subset=train) 

> mean((mpg-predict(lm.fit2,Auto)) [-train] ~2) 

[1] 18.90 

> lm.fit3 = lm(mpg~poly(horsepower ,3) ,data = Auto,subset=train) 

> mean((mpg-predict(lm.fit3,Auto)) [-train] ~2) 

[1] 19.26 

Using this split of the observations into a training set and a validation 
set, we find that the validation set error rates for the models with linear, 
quadratic, and cubic terms are 23.30, 18.90, and 19.26, respectively. 

These results are consistent with our previous findings: a model that 
predicts mpg using a quadratic function of horsepower performs better than 
a model that involves only a linear function of horsepower, and there is 
little evidence in favor of a model that uses a cubic function of horsepower. 

5.3.2 Leave-One-Out Cross-Validation 

The LOOCV estimate can be automatically computed for any generalized 
linear model using the glm() and cv.glmO functions. In the lab for Chap¬ 
ter 4, we used the glm() function to perform logistic regression by passing 
in the family="binomial" argument. But if we use glm() to fit a model 
without passing in the family argument, then it performs linear regression, 
just like the lm() function. So for instance, 

> glm.fit=glra(mpg^horsepower,data=Auto) 

> coef(glm.fit) 

(Intercept) horsepower 

39.936 -0.158 


and 

> lm.fit =lm(mpg~horsepower ,dat a = Auto) 

> coef(lm.fit) 

(Intercept) horsepower 

39.936 -0.158 

yield identical linear regression models. In this lab, we will perform linear 
regression using the glm() function rather than the lm() function because 
the latter can be used together with cv.glmO. The cv.glmO function is 
part of the boot library. 

> library(boot) 

> glm.fit = glm(mpg~horsepower ,data = Auto) 

> cv.err=cv.glm(Auto,glm.fit) 

> cv.err$delta 

1 1 
24.23 24.23 

The cv.glmO function produces a list with several components. The two 
numbers in the delta vector contain the cross-validation results. In this 
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case the numbers are identical (up to two decimal places) and correspond 
to the LOOCV statistic given in (5.1). Below, we discuss a situation in 
which the two numbers differ. Our cross-validation estimate for the test 
error is approximately 24.23. 

We can repeat this procedure for increasingly complex polynomial fits. 
To automate the process, we use the for() function to initiate a for loop 
which iteratively fits polynomial regressions for polynomials of order i = 1 
to i = 5, computes the associated cross-validation error, and stores it in 
the *th element of the vector cv.error. We begin by initializing the vector. 
This command will likely take a couple of minutes to run. 

> cv.error = rep(0,5) 

> for (i in 1 : 5) { 

+ glm.fit=glm(mpg^poly(horsepower,i),data=Auto) 

+ cv.error [i] = cv.glm(Auto, glm.fit)$delta [1] 

+ > 

> cv.error 

[1] 24.23 19.25 19.33 19.42 19.03 

As in Figure 5.4, we see a sharp drop in the estimated test MSE between 
the linear and quadratic fits, but then no clear improvement from using 
higher-order polynomials. 


5.3.3 k-Fold Cross-Validation 

The cv.glmO function can also be used to implement fc-fold CV. Below we 
use k = 10, a common choice for k, on the Auto data set. We once again set 
a random seed and initialize a vector in which we will store the CV errors 
corresponding to the polynomial fits of orders one to ten. 

> set . seed (17) 

> cv.error.10 = rep (0,10) 

> for (i in 1:10) { 

+ glm.fit=glm(mpg^poly(horsepower,i),data=Auto) 

+ cv.error.10[i]=cv.glm(Auto,glm.fit,K = 10)$delta [1] 

+ > 

> cv.error . 10 

[1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50 

Notice that the computation time is much shorter than that of LOOCV. 
(In principle, the computation time for LOOCV for a least squares linear 
model should be faster than for fc-fold CV, due to the availability of the 
formula (5.2) for LOOCV; however, unfortunately the cv.glmO function 
does not make use of this formula.) We still see little evidence that using 
cubic or higher-order polynomial terms leads to lower test error than simply 
using a quadratic fit. 

We saw in Section 5.3.2 that the two numbers associated with delta are 
essentially the same when LOOCV is performed. When we instead perform 
fc-fold CV, then the two numbers associated with delta differ slightly. The 
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first is the standard fc-fold CV estimate, as in (5.3). The second is a bias- 
corrected version. On this data set, the two estimates are very similar to 
each other. 


5.3.4 The Bootstrap 

We illustrate the use of the bootstrap in the simple example of Section 5.2, 
as well as on an example involving estimating the accuracy of the linear 
regression model on the Auto data set. 

Estimating the Accuracy of a Statistic of Interest 

One of the great advantages of the bootstrap approach is that it can be 
applied in almost all situations. No complicated mathematical calculations 
are required. Performing a bootstrap analysis in R entails only two steps. 
First, we must create a function that computes the statistic of interest. 
Second, we use the bootO function, which is part of the boot library, to 
perform the bootstrap by repeatedly sampling observations from the data 
set with replacement. 

The Portfolio data set in the ISLR package is described in Section 5.2. 
To illustrate the use of the bootstrap on this data, we must first create 
a function, alpha.fn(), which takes as input the (X,Y) data as well as 
a vector indicating which observations should be used to estimate a. The 
function then outputs the estimate for a based on the selected observations. 

> alpha.fn=function(data,index){ 

+ X=data$X[index] 

+ Y=data$Y[index] 

+ return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y))) 

+ > 

This function returns , or outputs, an estimate for a based on applying 
(5.7) to the observations indexed by the argument index. For instance, the 
following command tells R to estimate a using all 100 observations. 

> alpha.fn(Portfolio,1:100) 

[1] 0.576 

The next command uses the sample () function to randomly select 100 ob¬ 
servations from the range 1 to 100, with replacement. This is equivalent 
to constructing a new bootstrap data set and recomputing a based on the 
new data set. 

> set.seed (1) 

> alpha.fn(Portfolio,sample(100,100,replace=T)) 

[1] 0.596 

We can implement a bootstrap analysis by performing this command many 
times, recording all of the corresponding estimates for a, and computing 
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the resulting standard deviation. However, the bootO function automates 
this approach. Below we produce R = 1, 000 bootstrap estimates for a. 

> boot(Portfolio,alpha.fn,R=1000) 

ORDINARY NONPARAMETRIC BOOTSTRAP 


Call : 

boot(data = Portfolio, statistic = alpha.fn, R = 1000) 

Bootstrap Statistics : 

original bias std . error 

tl* 0.5758 -7.315e-05 0.0886 

The final output shows that using the original data, a = 0.5758, and that 
the bootstrap estimate for SE(d) is 0.0886. 

Estimating the Accuracy of a Linear Regression Model 

The bootstrap approach can be used to assess the variability of the coef¬ 
ficient estimates and predictions from a statistical learning method. Here 
we use the bootstrap approach in order to assess the variability of the 
estimates for /3q and /?i, the intercept and slope terms for the linear regres¬ 
sion model that uses horsepower to predict mpg in the Auto data set. We 
will compare the estimates obtained using the bootstrap to those obtained 
using the formulas for SE(/3 0 ) and SE(/3i) described in Section 3.1.2. 

We first create a simple function, boot .fn() , which takes in the Auto data 
set as well as a set of indices for the observations, and returns the intercept 
and slope estimates for the linear regression model. We then apply this 
function to the full set of 392 observations in order to compute the esti¬ 
mates of /3o and /3i on the entire data set using the usual linear regression 
coefficient estimate formulas from Chapter 3. Note that we do not need the 
{ and } at the beginning and end of the function because it is only one line 
long. 

> boot.fn=function(data,index) 

+ return(coef(lm(mpg~horsepower,data=data,subset=index))) 

> boot.fn(Auto , 1 : 392) 

(Intercept) horsepower 

39.936 -0.158 

The boot .fn() function can also be used in order to create bootstrap esti¬ 
mates for the intercept and slope terms by randomly sampling from among 
the observations with replacement. Here we give two examples. 

> set.seed (1) 

> boot . f n (Auto , sample (392,392 , replace=T)) 

(Intercept) horsepower 

38.739 -0.148 

> boot . f n (Auto , sample (392,392 , replace=T)) 

(Intercept) horsepower 

40.038 -0.160 
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Next, we use the bootO function to compute the standard errors of 1,000 
bootstrap estimates for the intercept and slope terms. 

> boot(Auto,boot.fn,1000) 

ORDINARY NONPARAMETRIC BOOTSTRAP 


Call : 

boot(data = Auto, statistic = boot.fn, R = 1000) 

Bootstrap Statistics : 

original bias std. error 

tl* 39.936 0.0297 0.8600 

t2* -0.158 -0.0003 0.0074 

This indicates that the bootstrap estimate for SE(/3 0 ) is 0.86, and that 
the bootstrap estimate for SE(/3i) is 0.0074. As discussed in Section 3.1.2, 
standard formulas can be used to compute the standard errors for the 
regression coefficients in a linear model. These can be obtained using the 
summary () function. 

> summary(lm(mpg~horsepower,data=Auto))$coef 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 39.936 0.71750 55.7 1.22e-187 

horsepower -0.158 0.00645 -24.5 7.03e-81 

The standard error estimates for 0q and obtained using the formulas 
from Section 3.1.2 are 0.717 for the intercept and 0.0064 for the slope. 
Interestingly, these are somewhat different from the estimates obtained 
using the bootstrap. Does this indicate a problem with the bootstrap? In 
fact, it suggests the opposite. Recall that the standard formulas given in 
Equation 3.8 on page 66 rely on certain assumptions. For example, they 
depend on the unknown parameter ct 2 , the noise variance. We then estimate 
cr 2 using the RSS. Now although the formula for the standard errors do not 
rely on the linear model being correct, the estimate for a 2 does. We see in 
Figure 3.8 on page 91 that there is a non-linear relationship in the data, and 
so the residuals from a linear fit will be inflated, and so will a 2 . Secondly, 
the standard formulas assume (somewhat unrealistically) that the Xi are 
fixed, and all the variability comes from the variation in the errors Ci. The 
bootstrap approach does not rely on any of these assumptions, and so it is 
likely giving a more accurate estimate of the standard errors of /3o and /3i 
than is the summary O function. 

Below we compute the bootstrap standard error estimates and the stan¬ 
dard linear regression estimates that result from fitting the quadratic model 
to the data. Since this model provides a good fit to the data (Figure 3.8), 
there is now a better correspondence between the bootstrap estimates and 
the standard estimates of SE(/3 0 ), SE(/3i) and SE(/? 2 )- 
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> boot.fn=function(data,index) 

+ coefficients(lm(mpg~horsepower+1(horsepower~2),data=data, 
subset = index)) 

> set . seed (1) 

> boot(Auto,boot.fn,1000) 

ORDINARY NONPARAMETRIC BOOTSTRAP 


Call : 

boot(data = Auto, statistic = boot.fn, R = 1000) 

Bootstrap Statistics : 

original bias std. error 
tl* 56.900 6.098 e-03 2.0945 

12 * -0.466 -1.777 e-04 0.0334 

t3 * 0.001 1.324e-06 0.0001 

> summary (lm (mpg^horsepower +1 (horsepower ~2) , data = Auto)) $coef 



Estimate 

Std. Error t 

value 

PrOltl) 

(Intercept ) 

56.9001 

1.80043 

32 

1.7e-109 

horsepower 

-0.4662 

0.03112 

-15 

2.3e-40 

I(horsepower ~2) 

0.0012 

0.00012 

10 

2.2 e-21 


5.4 Exercises 

Conceptual 

1. Using basic statistical properties of the variance, as well as single¬ 
variable calculus, derive (5.6). In other words, prove that a given by 
(5.6) does indeed minimize Var(aX + (1 — a)Y). 

2. We will now derive the probability that a given observation is part 
of a bootstrap sample. Suppose that we obtain a bootstrap sample 
from a set of n observations. 

(a) What is the probability that the first bootstrap observation is 
not the jth observation from the original sample? Justify your 
answer. 

(b) What is the probability that the second bootstrap observation 
is not the jth observation from the original sample? 

(c) Argue that the probability that the jth observation is not in the 
bootstrap sample is (1 — 1 /n) n . 

(d) When n = 5, what is the probability that the jth observation is 
in the bootstrap sample? 

(e) When n = 100, what is the probability that the jth observation 
is in the bootstrap sample? 
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(f) When n = 10, 000, what is the probability that the jth observa¬ 
tion is in the bootstrap sample? 

(g) Create a plot that displays, for each integer value of n from 1 
to 100, 000, the probability that the jth observation is in the 
bootstrap sample. Comment on what you observe. 

(h) We will now investigate numerically the probability that a boot¬ 
strap sample of size n = 100 contains the jth observation. Here 
j = 4. We repeatedly create bootstrap samples, and each time 
we record whether or not the fourth observation is contained in 
the bootstrap sample. 

> store=rep(NA, 10000) 

> ford in 1:10000)1 

store[i]=sum(sample(1:100, rep = TRUE)= = 4)>0 

} 

> mean(store) 

Comment on the results obtained. 

3. We now review fc-fold cross-validation. 

(a) Explain how fc-fold cross-validation is implemented. 

(b) What are the advantages and disadvantages of fc-fold cross- 
validation relative to: 

i. The validation set approach? 

ii. LOOCV? 

4. Suppose that we use some statistical learning method to make a pre¬ 
diction for the response Y for a particular value of the predictor X. 
Carefully describe how we might estimate the standard deviation of 
our prediction. 

Applied 

5. In Chapter 4, we used logistic regression to predict the probability of 
default using income and balance on the Default data set. We will 
now estimate the test error of this logistic regression model using the 
validation set approach. Do not forget to set a random seed before 
beginning your analysis. 

(a) Fit a logistic regression model that uses income and balance to 
predict default. 

(b) Using the validation set approach, estimate the test error of this 
model. In order to do this, you must perform the following steps: 

i. Split the sample set into a training set and a validation set. 
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ii. Fit a multiple logistic regression model using only the train¬ 
ing observations. 

iii. Obtain a prediction of default status for each individual in 
the validation set by computing the posterior probability of 
default for that individual, and classifying the individual to 
the default category if the posterior probability is greater 
than 0.5. 

iv. Compute the validation set error, which is the fraction of 
the observations in the validation set that are misclassified. 

(c) Repeat the process in (b) three times, using three different splits 
of the observations into a training set and a validation set. Com¬ 
ment on the results obtained. 

(d) Now consider a logistic regression model that predicts the prob¬ 
ability of default using income, balance, and a dummy variable 
for student. Estimate the test error for this model using the val¬ 
idation set approach. Comment on whether or not including a 
dummy variable for student leads to a reduction in the test error 
rate. 

6. We continue to consider the use of a logistic regression model to 
predict the probability of default using income and balance on the 
Default data set. In particular, we will now compute estimates for 
the standard errors of the income and balance logistic regression co¬ 
efficients in two different ways: (1) using the bootstrap, and (2) using 
the standard formula for computing the standard errors in the glm() 
function. Do not forget to set a random seed before beginning your 
analysis. 

(a) Using the summary() and glm() functions, determine the esti¬ 
mated standard errors for the coefficients associated with income 
and balance in a multiple logistic regression model that uses 
both predictors. 

(b) Write a function, boot. fn() , that takes as input the Default data 
set as well as an index of the observations, and that outputs 
the coefficient estimates for income and balance in the multiple 
logistic regression model. 

(c) Use the boot 0 function together with your boot. fn() function to 
estimate the standard errors of the logistic regression coefficients 
for income and balance. 

(d) Comment on the estimated standard errors obtained using the 
glm() function and using your bootstrap function. 

7. In Sections 5.3.2 and 5.3.3, we saw that the cv.glmO function can be 
used in order to compute the LOOCV test error estimate. Alterna¬ 
tively, one could compute those quantities using just the glm() and 


200 


5. Resampling Methods 


predict .glm() functions, and a for loop. You will now take this ap¬ 
proach in order to compute the LOOCV error for a simple logistic 
regression model on the Weekly data set. Recall that in the context 
of classification problems, the LOOCV error is given in (5.4). 

(a) Fit a logistic regression model that predicts Direction using Lagl 
and Lag2. 

(b) Fit a logistic regression model that predicts Direction using Lagl 
and Lag2 using all but the first observation. 

(c) Use the model from (b) to predict the direction of the first obser¬ 
vation. You can do this by predicting that the first observation 
will go up if P(Direction="Up" |Lagl, Lag2) > 0.5. Was this ob¬ 
servation correctly classified? 

(d) Write a for loop from i = 1 to i = n, where n is the number of 
observations in the data set, that performs each of the following 
steps: 

i. Fit a logistic regression model using all but the zth obser¬ 
vation to predict Direction using Lagl and Lag2. 

ii. Compute the posterior probability of the market moving up 
for the *th observation. 

iii. Use the posterior probability for the ith observation in order 
to predict whether or not the market moves up. 

iv. Determine whether or not an error was made in predicting 
the direction for the ith observation. If an error was made, 
then indicate this as a 1, and otherwise indicate it as a 0. 

(e) Take the average of the n numbers obtained in (d)iv in order to 
obtain the LOOCV estimate for the test error. Comment on the 
results. 

8. We will now perform cross-validation on a simulated data set. 

(a) Generate a simulated data set as follows: 

> set.seed (1) 

> y=rnorm(100) 

> x = rnorm (100) 

> y=x-2*x~2+rnorm (100) 

In this data set, what is n and what is p? Write out the model 
used to generate the data in equation form. 

(b) Create a scatterplot of X against Y. Comment on what you find. 

(c) Set a random seed, and then compute the LOOCV errors that 
result from fitting the following four models using least squares: 
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i. Y — fa + faX + e 

ii. Y = /3o + faX + faX 2 + € 

iii. y = /3 0 + faX + faX 2 + faX 3 + e 

iv. Y = fa + /3iX + faX 2 + faX 3 + /3 4 X 4 + e. 

Note you may find it helpful to use the data.frameO function 
to create a single data set containing both X and Y. 

(d) Repeat (c) using another random seed, and report your results. 
Are your results the same as what you got in (c)? Why? 

(e) Which of the models in (c) had the smallest LOOCV error? Is 
this what you expected? Explain your answer. 

(f) Comment on the statistical significance of the coefficient esti¬ 
mates that results from fitting each of the models in (c) using 
least squares. Do these results agree with the conclusions drawn 
based on the cross-validation results? 

9. We will now consider the Boston housing data set, from the MASS 
library. 

(a) Based on this data set, provide an estimate for the population 
mean of medv. Call this estimate ft. 

(b) Provide an estimate of the standard error of ft. Interpret this 
result. 

Hint: We can compute the standard error of the sample mean by 
dividing the sample standard deviation by the square root of the 
number of observations. 

(c) Now estimate the standard error of ft using the bootstrap. How 
does this compare to your answer from (b)? 

(d) Based on your bootstrap estimate from (c), provide a 95% con¬ 
fidence interval for the mean of medv. Compare it to the results 
obtained using t.test(Boston$medv). 

Hint: You can approximate a 95 % confidence interval using the 
formula [ft — 2SE(fi) 1 ft + 2 SE(ff)]. 

(e) Based on this data set, provide an estimate, ftmed, for the median 
value of medv in the population. 

(f) We now would like to estimate the standard error of ftmed • Unfor¬ 
tunately, there is no simple formula for computing the standard 
error of the median. Instead, estimate the standard error of the 
median using the bootstrap. Comment on your findings. 

(g) Based on this data set, provide an estimate for the tenth per¬ 
centile of medv in Boston suburbs. Call this quantity fto.i- (You 
can use the quantile() function.) 

(h) Use the bootstrap to estimate the standard error of fa.i- Com¬ 
ment on your findings. 


6 

Linear Model Selection 
and Regularization 


In the regression setting, the standard linear model 


Y — /3o + P 1 X 1 + • • • + fi p \ p + e (6-1) 

is commonly used to describe the relationship between a response Y and 
a set of variables Xi, X 2 , ■ ■ ■, X p . We have seen in Chapter 3 that one 
typically fits this model using least squares. 

In the chapters that follow, we consider some approaches for extending 
the linear model framework. In Chapter 7 we generalize (6.1) in order to 
accommodate non-linear, but still additive, relationships, while in Chap¬ 
ter 8 we consider even more general non-linear models. However, the linear 
model has distinct advantages in terms of inference and, on real-world prob¬ 
lems, is often surprisingly competitive in relation to non-linear methods. 
Hence, before moving to the non-linear world, we discuss in this chapter 
some ways in which the simple linear model can be improved, by replacing 
plain least squares fitting with some alternative fitting procedures. 

Why might we want to use another fitting procedure instead of least 
squares? As we will see, alternative fitting procedures can yield better pre¬ 
diction accuracy and model interpretability. 

• Prediction Accuracy: Provided that the true relationship between the 
response and the predictors is approximately linear, the least squares 
estimates will have low bias. If n p —that is, if n, the number of 

observations, is much larger than p , the number of variables—then the 
least squares estimates tend to also have low variance, and hence will 
perform well on test observations. However, if n is not much larger 
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than p , then there can be a lot of variability in the least squares fit, 
resulting in overfitting and consequently poor predictions on future 
observations not used in model training. And if p > n, then there 
is no longer a unique least squares coefficient estimate: the variance 
is infinite so the method cannot be used at all. By constraining or 
shrinking the estimated coefficients, we can often substantially reduce 
the variance at the cost of a negligible increase in bias. This can 
lead to substantial improvements in the accuracy with which we can 
predict the response for observations not used in model training. 

• Model Interpretability: It is often the case that some or many of the 
variables used in a multiple regression model are in fact not associ¬ 
ated with the response. Including such irrelevant variables leads to 
unnecessary complexity in the resulting model. By removing these 
variables—that is, by setting the corresponding coefficient estimates 
to zero—we can obtain a model that is more easily interpreted. Now 
least squares is extremely unlikely to yield any coefficient estimates 
that are exactly zero. In this chapter, we see some approaches for au¬ 
tomatically performing feature selection or variable selection —that is, 
for excluding irrelevant variables from a multiple regression model. 

There are many alternatives, both classical and modern, to using least 
squares to fit (6.1). In this chapter, we discuss three important classes of 
methods. 

• Subset Selection. This approach involves identifying a subset of the p 
predictors that we believe to be related to the response. We then fit 
a model using least squares on the reduced set of variables. 

• Shrinkage. This approach involves fitting a model involving all p pre¬ 
dictors. However, the estimated coefficients are shrunken towards zero 
relative to the least squares estimates. This shrinkage (also known as 
regularization) has the effect of reducing variance. Depending on what 
type of shrinkage is performed, some of the coefficients may be esti¬ 
mated to be exactly zero. Hence, shrinkage methods can also perform 
variable selection. 

• Dimension Reduction. This approach involves projecting the p predic¬ 
tors into a M-dimensional subspace, where M < p. This is achieved 
by computing M different linear combinations , or projections, of the 
variables. Then these M projections are used as predictors to fit a 
linear regression model by least squares. 

In the following sections we describe each of these approaches in greater de¬ 
tail, along with their advantages and disadvantages. Although this chapter 
describes extensions and modifications to the linear model for regression 
seen in Chapter 3, the same concepts apply to other methods, such as the 
classification models seen in Chapter 4. 
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6.1 Subset Selection 

In this section we consider some methods for selecting subsets of predictors. 
These include best subset and stepwise model selection procedures. 

6.1.1 Best Subset Selection 

To perform best subset selection , we fit a separate least squares regression 
for each possible combination of the p predictors. That is, we fit all p models 
that contain exactly one predictor, all (?)) = p{p— l)/2 models that contain 
exactly two predictors, and so forth. We then look at all of the resulting 
models, with the goal of identifying the one that is best. 

The problem of selecting the best model from among the 2 P possibilities 
considered by best subset selection is not trivial. This is usually broken up 
into two stages, as described in Algorithm 6.1. 


Algorithm 6.1 Best subset selection 

1. Let Mo denote the null model , which contains no predictors. This 
model simply predicts the sample mean for each observation. 

2. For k = 1,2 ,.. .p: 

(a) Fit all (£) models that contain exactly k predictors. 

(b) Pick the best among these (£) models, and call it Mk- Here best 
is defined as having the smallest RSS, or equivalently largest R 2 . 

3. Select a single best model from among Mo, ■ ■ ■ ,M P using cross- 
validated prediction error, C p (AIC), BIC, or adjusted R 2 . 


In Algorithm 6.1, Step 2 identifies the best model (on the training data) 
for each subset size, in order to reduce the problem from one of 2 P possible 
models to one of p + 1 possible models. In Figure 6.1, these models form 
the lower frontier depicted in red. 

Now in order to select a single best model, we must simply choose among 
these p + 1 options. This task must be performed with care, because the 
RSS of these p + 1 models decreases monotonically, and the R 2 increases 
monotonically, as the number of features included in the models increases. 
Therefore, if we use these statistics to select the best model, then we will 
always end up with a model involving all of the variables. The problem is 
that a low RSS or a high R 2 indicates a model with a low training error, 
whereas we wish to choose a model that has a low test error. (As shown 
in Chapter 2 in Figures 2.9-2.11, training error tends to be quite a bit 
smaller than test error, and a low training error by no means guarantees 
a low test error.) Therefore, in Step 3, we use cross-validated prediction 
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Number of Predictors Number of Predictors 

FIGURE 6.1. For each possible model containing a subset of the ten predictors 
in the Credit data set, the RSS and R 2 are displayed. The red frontier tracks the 
best model for a given number of predictors, according to RSS and R 2 . Though 
the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one 
of the variables is categorical and takes on three values, leading to the creation of 
two dummy variables. 


error, C p , BIC, or adjusted R 2 in order to select among Mo, Mi ,..., M p . 
These approaches are discussed in Section 6.1.3. 

An application of best subset selection is shown in Figure 6.1. Each 
plotted point corresponds to a least squares regression model fit using a 
different subset of the 11 predictors in the Credit data set, discussed in 
Chapter 3. Here the variable ethnicity is a three-level qualitative variable, 
and so is represented by two dummy variables, which are selected separately 
in this case. We have plotted the RSS and R 2 statistics for each model, as 
a function of the number of variables. The red curves connect the best 
models for each model size, according to RSS or R 2 . The figure shows that, 
as expected, these quantities improve as the number of variables increases; 
however, from the three-variable model on, there is little improvement in 
RSS and R 2 as a result of including additional predictors. 

Although we have presented best subset selection here for least squares 
regression, the same ideas apply to other types of models, such as logistic 
regression. In the case of logistic regression, instead of ordering models by 
RSS in Step 2 of Algorithm 6.1, we instead use the deviance, a measure 
that plays the role of RSS for a broader class of models. The deviance is 
negative two times the maximized log-likelihood; the smaller the deviance, 
the better the fit. 

While best subset selection is a simple and conceptually appealing ap¬ 
proach, it suffers from computational limitations. The number of possible 
models that must be considered grows rapidly as p increases. In general, 
there are 2 v models that involve subsets of p predictors. So if p = 10, 
then there are approximately 1,000 possible models to be considered, and if 


deviance 
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p = 20, then there are over one million possibilities! Consequently, best sub¬ 
set selection becomes computationally infeasible for values of p greater than 
around 40, even with extremely fast modern computers. There are compu¬ 
tational shortcuts—so called branch-and-bound techniques—for eliminat¬ 
ing some choices, but these have their limitations as p gets large. They also 
only work for least squares linear regression. We present computationally 
efficient alternatives to best subset selection next. 

6.1.2 Stepwise Selection 

For computational reasons, best subset selection cannot be applied with 
very large p. Best subset selection may also suffer from statistical problems 
when p is large. The larger the search space, the higher the chance of finding 
models that look good on the training data, even though they might not 
have any predictive power on future data. Thus an enormous search space 
can lead to overfitting and high variance of the coefficient estimates. 

For both of these reasons, stepwise methods, which explore a far more 
restricted set of models, are attractive alternatives to best subset selection. 


Forward Stepwise Selection 

Forward stepwise selection is a computationally efficient alternative to best 
subset selection. While the best subset selection procedure considers all 
2 P possible models containing subsets of the p predictors, forward step¬ 
wise considers a much smaller set of models. Forward stepwise selection 
begins with a model containing no predictors, and then adds predictors 
to the model, one-at-a-time, until all of the predictors are in the model. 
In particular, at each step the variable that gives the greatest additional 
improvement to the fit is added to the model. More formally, the forward 
stepwise selection procedure is given in Algorithm 6.2. 


Algorithm 6.2 Forward stepwise selection 

1. Let Mq denote the null model, which contains no predictors. 

2. For k = 0,... ,p — 1: 

(a) Consider all p — k models that augment the predictors in Mk 
with one additional predictor. 

(b) Choose the best among these p — k models, and call it Aik+i- 
Here best is defined as having smallest RSS or highest R 1 2 3 . 

3. Select a single best model from among A4o,...,A4 p using cross- 
validated prediction error, C p (AIC), BIC, or adjusted R 2 . 


forward 

stepwise 

selection 
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Unlike best subset selection, which involved fitting 2 P models, forward 
stepwise selection involves fitting one null model, along with p — k models 
in the fcth iteration, for k = 0,... , p — 1. This amounts to a total of 1 + 
J2k=o(P~^) = 1 +p{p+ 1)/2 models. This is a substantial difference: when 
p = 20, best subset selection requires fitting 1,048,576 models, whereas 
forward stepwise selection requires fitting only 211 models. 1 

In Step 2(b) of Algorithm 6.2, we must identify the best model from 
among those p—k that augment A 4k with one additional predictor. We can 
do this by simply choosing the model with the lowest RSS or the highest 
R 2 . However, in Step 3, we must identify the best model among a set of 
models with different numbers of variables. This is more challenging, and 
is discussed in Section 6.1.3. 

Forward stepwise selection’s computational advantage over best subset 
selection is clear. Though forward stepwise tends to do well in practice, 
it is not guaranteed to find the best possible model out of all 2 P mod¬ 
els containing subsets of the p predictors. For instance, suppose that in a 
given data set with p = 3 predictors, the best possible one-variable model 
contains X \, and the best possible two-variable model instead contains Xi 
and X 3 . Then forward stepwise selection will fail to select the best possible 
two-variable model, because A4i will contain X±, so XI 2 must also contain 
X\ together with one additional variable. 

Table 6.1, which shows the first four selected models for best subset 
and forward stepwise selection on the Credit data set, illustrates this phe¬ 
nomenon. Both best subset selection and forward stepwise selection choose 
rating for the best one-variable model and then include income and student 
for the two- and three-variable models. However, best subset selection re¬ 
places rating by cards in the four-variable model, while forward stepwise 
selection must maintain rating in its four-variable model. In this example, 
Figure 6.1 indicates that there is not much difference between the three- 
and four-variable models in terms of RSS, so either of the four-variable 
models will likely be adequate. 

Forward stepwise selection can be applied even in the high-dimensional 
setting where n < p; however, in this case, it is possible to construct sub¬ 
models Ado, • • •, Xin-i only, since each submodel is fit using least squares, 
which will not yield a unique solution if p > n. 

Backward Stepwise Selection 

Like forward stepwise selection, backward stepwise selection provides an 
efficient alternative to best subset selection. However, unlike forward 


1 Though forward stepwise selection considers p(p + l)/2 + 1 models, it performs a 

guided search over model space, and so the effective model space considered contains 
substantially more than p(p+ l)/2 + 1 models. 


backward 
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# Variables 

Best subset 

Forward stepwise 

One 

rating 

rating 

Two 

rating, income 

rating, income 

Three 

rating, income, student 

rating, income, student 

Four 

cards, income 

rating, income, 


student, limit 

student, limit 


TABLE 6.1. The first four selected models for best subset selection and forward 
stepwise selection on the Credit data set. The first three models are identical but 
the fourth models differ. 

stepwise selection, it begins with the full least squares model containing 
all p predictors, and then iteratively removes the least useful predictor, 
one-at-a-time. Details are given in Algorithm 6.3. 


Algorithm 6.3 Backward stepwise selection 

1. Let A f p denote the full model, which contains all p predictors. 

2. For k = p,p — 1,..., 1: 

(a) Consider all k models that contain all but one of the predictors 
in Mk, for a total of /c — 1 predictors. 

(b) Choose the best among these k models, and call it Mk- 1 - Here 
best is defined as having smallest RSS or highest R 2 . 

3. Select a single best model from among Mo, ■ ■ ■ ,M P using cross- 
validated prediction error, C v (AIC), BIC, or adjusted R?. 


Like forward stepwise selection, the backward selection approach searches 
through only 1 +p{p+ 1)/2 models, and so can be applied in settings where 
p is too large to apply best subset selection. 2 Also like forward stepwise 
selection, backward stepwise selection is not guaranteed to yield the best 
model containing a subset of the p predictors. 

Backward selection requires that the number of samples n is larger than 
the number of variables p (so that the full model can be fit). In contrast, 
forward stepwise can be used even when n < p, and so is the only viable 
subset method when p is very large. 


2 Like forward stepwise selection, backward stepwise selection performs a guided 
search over model space, and so effectively considers substantially more than l+p(p+l)/2 
models. 
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Hybrid Approaches 

The best subset, forward stepwise, and backward stepwise selection ap¬ 
proaches generally give similar but not identical models. As another al¬ 
ternative, hybrid versions of forward and backward stepwise selection are 
available, in which variables are added to the model sequentially, in analogy 
to forward selection. However, after adding each new variable, the method 
may also remove any variables that no longer provide an improvement in 
the model fit. Such an approach attempts to more closely mimic best sub¬ 
set selection while retaining the computational advantages of forward and 
backward stepwise selection. 

6.1.3 Choosing the Optimal Model 

Best subset selection, forward selection, and backward selection result in 
the creation of a set of models, each of which contains a subset of the p pre¬ 
dictors. In order to implement these methods, we need a way to determine 
which of these models is best. As we discussed in Section 6.1.1, the model 
containing all of the predictors will always have the smallest RSS and the 
largest R 2 , since these quantities are related to the training error. Instead, 
we wish to choose a model with a low test error. As is evident here, and as 
we show in Chapter 2, the training error can be a poor estimate of the test 
error. Therefore, RSS and R 2 are not suitable for selecting the best model 
among a collection of models with different numbers of predictors. 

In order to select the best model with respect to test error, we need to 
estimate this test error. There are two common approaches: 

1. We can indirectly estimate test error by making an adjustment to the 
training error to account for the bias due to overfitting. 

2. We can directly estimate the test error, using either a validation set 
approach or a cross-validation approach, as discussed in Chapter 5. 

We consider both of these approaches below. 

C p , AIC, BIC, and Adjusted R 2 

We show in Chapter 2 that the training set MSE is generally an under¬ 
estimate of the test MSE. (Recall that MSE = RSS/n.) This is because 
when we fit a model to the training data using least squares, we specifi¬ 
cally estimate the regression coefficients such that the training RSS (but 
not the test RSS) is as small as possible. In particular, the training error 
will decrease as more variables are included in the model, but the test error 
may not. Therefore, training set RSS and training set R 2 cannot be used 
to select from among a set of models with different numbers of variables. 

However, a number of techniques for adjusting the training error for the 
model size are available. These approaches can be used to select among a set 


6.1 Subset Selection 


211 



FIGURE 6.2. C p , BIC, and adjusted R 2 are shown for the best models of each 
size for the Credit data set (the lower frontier in Figure 6.1). Cp and BIC are 
estimates of test MSE. In the middle plot we see that the BIC estimate of test 
error shows an increase after four variables are selected. The other two plots are 
rather flat after four variables are included. 


of models with different numbers of variables. We now consider four such 


approaches: C p , Akaike information criterion (AIC), Bayesian information 
criterion (BIC), and adjusted R 2 . Figure 6.2 displays C p , BIC, and adjusted 
R 2 for the best model of each size produced by best subset selection on the 
Credit data set. 

For a fitted least squares model containing d predictors, the C p estimate 
of test MSE is computed using the equation 

C p = - (RSS + 2dd 2 ), (6.2) 

n 


c p 

Akaike 

information 

criterion 

Bayesian 

information 

criterion 

adjusted R 2 


where a 2 is an estimate of the variance of the error e associated with each 
response measurement in (6.1). 3 Essentially, the C p statistic adds a penalty 
of ‘Ida 2 to the training RSS in order to adjust for the fact that the training 
error tends to underestimate the test error. Clearly, the penalty increases as 
the number of predictors in the model increases; this is intended to adjust 
for the corresponding decrease in training RSS. Though it is beyond the 
scope of this book, one can show that if <x 2 is an unbiased estimate of a 2 in 
(6.2), then C p is an unbiased estimate of test MSE. As a consequence, the 
C p statistic tends to take on a small value for models with a low test error, 
so when determining which of a set of models is best, we choose the model 
with the lowest C p value. In Figure 6.2, C p selects the six-variable model 
containing the predictors income, limit, rating, cards, age and student. 


3 Mallow’s Cp is sometimes defined as C' p = RSS/cr 2 + 2 d — n. This is equivalent to 
the definition given above in the sense that Cp = ffr 2 (C r p + n), and so the model with 
smallest Cp also has smallest C' p . 
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The AIC criterion is defined for a large class of models fit by maximum 
likelihood. In the case of the model (6.1) with Gaussian errors, maximum 
likelihood and least squares are the same thing. In this case AIC is given by 

AIC = -4t(R.SS + 2 da 2 ), 

na 2 

where, for simplicity, we have omitted an additive constant. Hence for least 
squares models, C p and AIC are proportional to each other, and so only 
C p is displayed in Figure 6.2. 

BIC is derived from a Bayesian point of view, but ends up looking similar 
to C p (and AIC) as well. For the least squares model with d predictors, the 
BIC is, up to irrelevant constants, given by 

BIC = - (RSS + log (n)dd 2 ) . (6.3) 

n 

Like C p , the BIC will tend to take on a small value for a model with a 
low test error, and so generally we select the model that has the lowest 
BIC value. Notice that BIC replaces the 2 da 2 used by C p with a log(n)dd 2 
term, where n is the number of observations. Since logn > 2 for any n > 7, 
the BIC statistic generally places a heavier penalty on models with many 
variables, and hence results in the selection of smaller models than C p . 
In Figure 6.2, we see that this is indeed the case for the Credit data set; 
BIC chooses a model that contains only the four predictors income, limit, 
cards, and student. In this case the curves are very flat and so there does 
not appear to be much difference in accuracy between the four-variable and 
six-variable models. 

The adjusted R 2 statistic is another popular approach for selecting among 
a set of models that contain different numbers of variables. Recall from 
Chapter 3 that the usual R 2 is defined as 1 — RSS/TSS, where TSS = 
— V ) 2 is the total sum of squares for the response. Since RSS always 
decreases as more variables are added to the model, the R 2 always increases 
as more variables are added. For a least squares model with d variables, 
the adjusted R 2 statistic is calculated as 

2 RSS/(n — d— 1) 

Adjusted R = 1 - TSS/( ^ i:| - (6.4) 

Unlike C p , AIC, and BIC, for which a small value indicates a model with 
a low test error, a large value of adjusted R 2 indicates a model with a 
small test error. Maximizing the adjusted R 2 is equivalent to minimizing 

T> CO 

n-d-i ■ While RSS always decreases as the number of variables in the model 
R QQ 

increases, _^_ 1 may increase or decrease, due to the presence of d in the 
denominator. 

The intuition behind the adjusted R 2 is that once all of the correct 
variables have been included in the model, adding additional noise variables 
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will lead to only a very small decrease in RSS. Since adding noise variables 

bSS 

leads to an increase in d , such variables will lead to an increase in n ';, , , 
and consequently a decrease in the adjusted R 2 . Therefore, in theory, the 
model with the largest adjusted R 2 will have only correct variables and 
no noise variables. Unlike the R 2 statistic, the adjusted R 2 statistic pays 
a price for the inclusion of unnecessary variables in the model. Figure 6.2 
displays the adjusted R 2 for the Credit data set. Using this statistic results 
in the selection of a model that contains seven variables, adding gender to 
the model selected by C p and AIC. 

C p . AIC, and BIC all have rigorous theoretical justifications that are 
beyond the scope of this book. These justifications rely on asymptotic ar¬ 
guments (scenarios where the sample size n is very large). Despite its pop¬ 
ularity, and even though it is quite intuitive, the adjusted R 2 is not as well 
motivated in statistical theory as AIC, BIC, and C p . All of these measures 
are simple to use and compute. Here we have presented the formulas for 
AIC, BIC, and C p in the case of a linear model fit using least squares; 
however, these quantities can also be defined for more general types of 
models. 

Validation and Cross-Validation 

As an alternative to the approaches just discussed, we can directly esti¬ 
mate the test error using the validation set and cross-validation methods 
discussed in Chapter 5. We can compute the validation set error or the 
cross-validation error for each model under consideration, and then select 
the model for which the resulting estimated test error is smallest. This pro¬ 
cedure has an advantage relative to AIC, BIC, C p , and adjusted R 2 , in that 
it provides a direct estimate of the test error, and makes fewer assumptions 
about the true underlying model. It can also be used in a wider range of 
model selection tasks, even in cases where it is hard to pinpoint the model 
degrees of freedom (e.g. the number of predictors in the model) or hard to 
estimate the error variance a 2 . 

In the past, performing cross-validation was computationally prohibitive 
for many problems with large p and/or large n, and so AIC, BIC, C p , 
and adjusted R 2 were more attractive approaches for choosing among a 
set of models. However, nowadays with fast computers, the computations 
required to perform cross-validation are hardly ever an issue. Thus, cross- 
validation is a very attractive approach for selecting from among a number 
of models under consideration. 

Figure 6.3 displays, as a function of d, the BIC, validation set errors, and 
cross-validation errors on the Credit data, for the best d-variable model. 
The validation errors were calculated by randomly selecting three-quarters 
of the observations as the training set, and the remainder as the valida¬ 
tion set. The cross-validation errors were computed using k = 10 folds. 
In this case, the validation and cross-validation methods both result in a 
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FIGURE 6.3. For the Credit data set, three quantities are displayed for the 
best model containing d predictors, for d ranging from 1 to 11. The overall best 
model, based on each of these quantities, is shown as a blue cross. Left: Square 
root of BIC. Center: Validation set errors. Right: Cross-validation errors. 


six-variable model. However, all three approaches suggest that the four-, 
five-, and six-variable models are roughly equivalent in terms of their test 
errors. 

In fact, the estimated test error curves displayed in the center and right- 
hand panels of Figure 6.3 are quite flat. While a three-variable model clearly 
has lower estimated test error than a two-variable model, the estimated test 
errors of the 3- to 11-variable models are quite similar. Furthermore, if we 
repeated the validation set approach using a different split of the data into 
a training set and a validation set, or if we repeated cross-validation using 
a different set of cross-validation folds, then the precise model with the 
lowest estimated test error would surely change. In this setting, we can 
select a model using the one-standard-error rule. We first calculate the 
standard error of the estimated test MSE for each model size, and then standard- 
select the smallest model for which the estimated test error is within one “ 1 j° r 
standard error of the lowest point on the curve. The rationale here is that 
if a set of models appear to be more or less equally good, then we might 
as well choose the simplest model—that is, the model with the smallest 
number of predictors. In this case, applying the one-standard-error rule 
to the validation set or cross-validation approach leads to selection of the 
three-variable model. 


6.2 Shrinkage Methods 

The subset selection methods described in Section 6.1 involve using least 
squares to fit a linear model that contains a subset of the predictors. As an 
alternative, we can fit a model containing all p predictors using a technique 
that constrains or regularizes the coefficient estimates, or equivalently, that 
shrinks the coefficient estimates towards zero. It may not be immediately 
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obvious why such a constraint should improve the fit, but it turns out that 
shrinking the coefficient estimates can significantly reduce their variance. 
The two best-known techniques for shrinking the regression coefficients 
towards zero are ridge regression and the lasso. 


6.2.1 Ridge Regression 

Recall from Chapter 3 that the least squares fitting procedure estimates 
Bo, fii ,..., /3 P using the values that minimize 

RSS = J Vi - Bo - Pi Xi i ) ' 

<=i \ j =l / 

Ridge regression is very similar to least squares, except that the coefficients . ^ 
are estimated by minimizing a slightly different quantity. In particular, the regression 
ridge regression coefficient estimates $ R are the values that minimize 

n / p \ 2 p p 

E W+ A^ 2 = RSS + A^/3 2 , (6.5) 

1 \ J=1 / J=1 


where A > 0 is a tuning parameter , to be determined separately. Equa¬ 
tion 6.5 trades off two different criteria. As with least squares, ridge regres¬ 
sion seeks coefficient estimates that fit the data well, by making the RSS 
small. However, the second term, AJA/3 2 , called a shrinkage penalty , is 
small when Bi, ..., Bp are close to zero, and so it has the effect of shrinking 
the estimates of Bj towards zero. The tuning parameter A serves to control 
the relative impact of these two terms on the regression coefficient esti¬ 
mates. When A = 0, the penalty term has no effect, and ridge regression 
will produce the least squares estimates. However, as A —> oo, the impact of 
the shrinkage penalty grows, and the ridge regression coefficient estimates 
will approach zero. Unlike least squares, which generates only one set of co¬ 
efficient estimates, ridge regression will produce a different set of coefficient 
estimates, B \, f° r each value of A. Selecting a good value for A is critical; 
we defer this discussion to Section 6.2.3, where we use cross-validation. 

Note that in (6.5), the shrinkage penalty is applied to /3i,...,/3 p , but 
not to the intercept Bo■ We want to shrink the estimated association of 
each variable with the response; however, we do not want to shrink the 
intercept, which is simply a measure of the mean value of the response 
when Xu = Xa = ... = Xi P = 0. If we assume that the variables—that is, 
the columns of the data matrix X —have been centered to have mean zero 
before ridge regression is performed, then the estimated intercept will take 
the form f3 0 = y = YJi=i Vi/ n ■ 


tuning 

parameter 


shrinkage 

penalty 
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FIGURE 6.4. The standardized ridge regression coefficients are displayed for 
the Credit data set, as a function of X and ||/3a* II 2 /1|/31 |2 - 


An Application to the Credit Data 

In Figure 6.4, the ridge regression coefficient estimates for the Credit data 
set are displayed. In the left-hand panel, each curve corresponds to the 
ridge regression coefficient estimate for one of the ten variables, plotted 
as a function of A. For example, the black solid line represents the ridge 
regression estimate for the income coefficient, as A is varied. At the extreme 
left-hand side of the plot, A is essentially zero, and so the corresponding 
ridge coefficient estimates are the same as the usual least squares esti¬ 
mates. But as A increases, the ridge coefficient estimates shrink towards 
zero. When A is extremely large, then all of the ridge coefficient estimates 
are basically zero; this corresponds to the null model that contains no pre¬ 
dictors. In this plot, the income, limit, rating, and student variables are 
displayed in distinct colors, since these variables tend to have by far the 
largest coefficient estimates. While the ridge coefficient estimates tend to 
decrease in aggregate as A increases, individual coefficients, such as rating 
and income, may occasionally increase as A increases. 

The right-hand panel of Figure 6.4 displays the same ridge coefficient 
estimates as the left-hand panel, but instead of displaying A on the 2 -axis, 
we now display ||/3^|| 2 /1|/3|| 2 , where $ denotes the vector of least squares 
coefficient estimates. The notation IldlU denotes the £2 norm (pronounced 

11 " _ _ v i 2 norm 

“ell 2”) of a vector, and is defined as ||/3|| 2 = \lY7j=\Pj ■ It measures 
the distance of )3 from zero. As A increases, the £2 norm of will always 
decrease, and so will 1111 2 /11 /311 2 • The latter quantity ranges from 1 (when 
A = 0, in which case the ridge regression coefficient estimate is the same 
as the least squares estimate, and so their 1 2 norms are the same) to 0 
(when A = 00 , in which case the ridge regression coefficient estimate is a 
vector of zeros, with £2 norm equal to zero). Therefore, we can think of the 
x-axis in the right-hand panel of Figure 6.4 as the amount that the ridge 
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regression coefficient estimates have been shrunken towards zero; a small 
value indicates that they have been shrunken very close to zero. 

The standard least squares coefficient estimates discussed in Chapter 3 
are scale equivariant: multiplying Xj by a constant c simply leads to a ^ 
scaling of the least squares coefficient estimates by a factor of 1 /c. In other equivariant 
words, regardless of how the jth predictor is scaled, Xjj3j will remain the 
same. In contrast, the ridge regression coefficient estimates can change sub¬ 
stantially when multiplying a given predictor by a constant. For instance, 
consider the income variable, which is measured in dollars. One could rea¬ 
sonably have measured income in thousands of dollars, which would result 
in a reduction in the observed values of income by a factor of 1,000. Now due 
to the sum of squared coefficients term in the ridge regression formulation 
(6.5), such a change in scale will not simply cause the ridge regression co¬ 
efficient estimate for income to change by a factor of 1,000. In other words, 

Xjf3^ x will depend not only on the value of A, but also on the scaling of the 
jth predictor. In fact, the value of Xjf may even depend on the scaling 
of the other predictors! Therefore, it is best to apply ridge regression after 
standardizing the predictors , using the formula 



( 6 . 6 ) 


so that they are all on the same scale. In (6.6), the denominator is the 
estimated standard deviation of the jth predictor. Consequently, all of the 
standardized predictors will have a standard deviation of one. As a re¬ 
sult the final fit will not depend on the scale on which the predictors are 
measured. In Figure 6.4, the y- axis displays the standardized ridge regres¬ 
sion coefficient estimates—that is, the coefficient estimates that result from 
performing ridge regression using standardized predictors. 

Why Does Ridge Regression Improve Over Least Squares? 

Ridge regression’s advantage over least squares is rooted in the bias-variance 
trade-off. As A increases, the flexibility of the ridge regression fit decreases, 
leading to decreased variance but increased bias. This is illustrated in the 
left-hand panel of Figure 6.5, using a simulated data set containing p = 45 
predictors and n = 50 observations. The green curve in the left-hand panel 
of Figure 6.5 displays the variance of the ridge regression predictions as a 
function of A. At the least squares coefficient estimates, which correspond 
to ridge regression with A = 0, the variance is high but there is no bias. But 
as A increases, the shrinkage of the ridge coefficient estimates leads to a 
substantial reduction in the variance of the predictions, at the expense of a 
slight increase in bias. Recall that the test mean squared error (MSE), plot¬ 
ted in purple, is a function of the variance plus the squared bias. For values 
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a wffh/mu 


FIGURE 6.5. Squared bias (black), variance (green), and test mean squared 
error (purple) for the ridge regression predictions on a simulated data set, as a 
function of X and 11/3^*|| 2 /1|/31| 2 - The horizontal dashed lines indicate the minimum 
possible MSE. The purple crosses indicate the ridge regression models for which 
the MSE is smallest. 


of A up to about 10, the variance decreases rapidly, with very little increase 
in bias, plotted in black. Consequently, the MSE drops considerably as A 
increases from 0 to 10. Beyond this point, the decrease in variance due to 
increasing A slows, and the shrinkage on the coefficients causes them to be 
significantly underestimated, resulting in a large increase in the bias. The 
minimum MSE is achieved at approximately A = 30. Interestingly, because 
of its high variance, the MSE associated with the least squares fit, when 
A = 0, is almost as high as that of the null model for which all coefficient 
estimates are zero, when A = 00 . However, for an intermediate value of A, 
the MSE is considerably lower. 

The right-hand panel of Figure 6.5 displays the same curves as the left- 
hand panel, this time plotted against the £2 norm of the ridge regression 
coefficient estimates divided by the £2 norm of the least squares estimates. 
Now as we move from left to right, the fits become more flexible, and so 
the bias decreases and the variance increases. 

In general, in situations where the relationship between the response 
and the predictors is close to linear, the least squares estimates will have 
low bias but may have high variance. This means that a small change in 
the training data can cause a large change in the least squares coefficient 
estimates. In particular, when the number of variables p is almost as large 
as the number of observations n, as in the example in Figure 6.5, the 
least squares estimates will be extremely variable. And if p > n, then the 
least squares estimates do not even have a unique solution, whereas ridge 
regression can still perform well by trading off a small increase in bias for a 
large decrease in variance. Hence, ridge regression works best in situations 
where the least squares estimates have high variance. 

Ridge regression also has substantial computational advantages over best 
subset selection, which requires searching through 2 P models. As we 
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discussed previously, even for moderate values of p, such a search can 
be computationally infeasible. In contrast, for any fixed value of A, ridge 
regression only fits a single model, and the model-fitting procedure can 
be performed quite quickly. In fact, one can show that the computations 
required to solve (6.5), simultaneously for all values of A, are almost iden¬ 
tical to those for fitting a model using least squares. 

6.2.2 The Lasso 

Ridge regression does have one obvious disadvantage. Unlike best subset, 
forward stepwise, and backward stepwise selection, which will generally 
select models that involve just a subset of the variables, ridge regression 
will include all p predictors in the final model. The penalty A /3| i n (6.5) 
will shrink all of the coefficients towards zero, but it will not set any of them 
exactly to zero (unless A = oo). This may not be a problem for prediction 
accuracy, but it can create a challenge in model interpretation in settings in 
which the number of variables p is quite large. For example, in the Credit 
data set, it appears that the most important variables are income, limit, 
rating, and student. So we might wish to build a model including just 
these predictors. However, ridge regression will always generate a model 
involving all ten predictors. Increasing the value of A will tend to reduce 
the magnitudes of the coefficients, but will not result in exclusion of any of 
the variables. 

The lasso is a relatively recent alternative to ridge regression that over¬ 
comes this disadvantage. The lasso coefficients, /3^, minimize the quantity 

n / p \ 2 p p 

E W-A)-E 1/3,1 = RSS + A^ \pj\- (6.7) 

i=l \ j= 1 J j=l j= 1 

Comparing (6.7) to (6.5), we see that the lasso and ridge regression have 
similar formulations. The only difference is that the /3| term in the ridge 
regression penalty (6.5) has been replaced by \[ij\ in the lasso penalty (6.7). 
In statistical parlance, the lasso uses an t\ (pronounced “ell 1”) penalty 
instead of an £2 penalty. The £1 norm of a coefficient vector /3 is given by 

m\i = Em- 

As with ridge regression, the lasso shrinks the coefficient estimates 
towards zero. However, in the case of the lasso, the l\ penalty has the effect 
of forcing some of the coefficient estimates to be exactly equal to zero when 
the tuning parameter A is sufficiently large. Hence, much like best subset se¬ 
lection, the lasso performs variable selection. As a result, models generated 
from the lasso are generally much easier to interpret than those produced 
by ridge regression. We say that the lasso yields sparse models—that is, 
models that involve only a subset of the variables. As in ridge regression, 
selecting a good value of A for the lasso is critical; we defer this discussion 
to Section 6.2.3, where we use cross-validation. 
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FIGURE 6.6. The standardized lasso coefficients on the Credit data set are 
shown as a function of X and ||4a ||i/1|/31| i - 


As an example, consider the coefficient plots in Figure 6.6, which are gen¬ 
erated from applying the lasso to the Credit data set. When A = 0, then 
the lasso simply gives the least squares fit, and when A becomes sufficiently 
large, the lasso gives the null model in which all coefficient estimates equal 
zero. However, in between these two extremes, the ridge regression and 
lasso models are quite different from each other. Moving from left to right 
in the right-hand panel of Figure 6.6, we observe that at first the lasso re¬ 
sults in a model that contains only the rating predictor. Then student and 
limit enter the model almost simultaneously, shortly followed by income. 
Eventually, the remaining variables enter the model. Hence, depending on 
the value of A, the lasso can produce a model involving any number of vari¬ 
ables. In contrast, ridge regression will always include all of the variables in 
the model, although the magnitude of the coefficient estimates will depend 
on A. 


Another Formulation for Ridge Regression and the Lasso 

One can show that the lasso and ridge regression coefficient estimates solve 
the problems 


minimize 

P 


E 
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El 0j x ij I * 
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respectively. In other words, for every value of A, there is some s such that 
the Equations (6.7) and (6.8) will give the same lasso coefficient estimates. 
Similarly, for every value of A there is a corresponding s such that Equa¬ 
tions (6.5) and (6.9) will give the same ridge regression coefficient estimates. 
When p = 2, then (6.8) indicates that the lasso coefficient estimates have 
the smallest RSS out of all points that lie within the diamond defined by 
|/3i| + |/^ 2 1 < s. Similarly, the ridge regression estimates have the smallest 
RSS out of all points that lie within the circle defined by + /?§ < s. 

We can think of (6.8) as follows. When we perform the lasso we are trying 
to find the set of coefficient estimates that lead to the smallest RSS, subject 
to the constraint that there is a budget s for how large \Pj\ can be. 

When s is extremely large, then this budget is not very restrictive, and so 
the coefficient estimates can be large. In fact, if s is large enough that the 
least squares solution falls within the budget, then (6.8) will simply yield 
the least squares solution. In contrast, if s is small, then l/%l mus t be 

small in order to avoid violating the budget. Similarly, (6.9) indicates that 
when we perform ridge regression, we seek a set of coefficient estimates 
such that the RSS is as small as possible, subject to the requirement that 
Y^j= 1 0j n °t exceed the budget s. 

The formulations (6.8) and (6.9) reveal a close connection between the 
lasso, ridge regression, and best subset selection. Consider the problem 

p \ 2 1 p 

Vi-P o - E PjXij I > subject to E I {@3 7^ 0) < S. 
3=1 ) J 3=1 

( 6 . 10 ) 

Here I(8j ^ 0) is an indicator variable: it takes on a value of 1 if Bj ^ 0, and 
equals zero otherwise. Then (6.10) amounts to finding a set of coefficient es¬ 
timates such that RSS is as small as possible, subject to the constraint that 
no more than s coefficients can be nonzero. The problem (6.10) is equivalent 
to best subset selection. Unfortunately, solving (6.10) is computationally 
infeasible when p is large, since it requires considering all ((() models con¬ 
taining s predictors. Therefore, we can interpret ridge regression and the 
lasso as computationally feasible alternatives to best subset selection that 
replace the intractable form of the budget in (6.10) with forms that are 
much easier to solve. Of course, the lasso is much more closely related to 
best subset selection, since only the lasso performs feature selection for s 
sufficiently small in (6.8). 

The Variable Selection Property of the Lasso 

Why is it that the lasso, unlike ridge regression, results in coefficient 
estimates that are exactly equal to zero? The formulations (6.8) and (6.9) 
can be used to shed light on the issue. Figure 6.7 illustrates the situation. 
The least squares solution is marked as /?, while the blue diamond and 
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FIGURE 6.7. Contours of the error and constraint functions for the lasso 
(left) and ridge regression (right). The solid blue areas are the constraint re¬ 
gions. |/3i| + |/321 < s and j3\ + /3§ < s, while the red ellipses are the contours of 
the RSS. 

circle represent the lasso and ridge regression constraints in (6.8) and (6.9), 
respectively. If s is sufficiently large, then the constraint regions will con¬ 
tain j3, and so the ridge regression and lasso estimates will be the same as 
the least squares estimates. (Such a large value of s corresponds to A = 0 
in (6.5) and (6.7).) However, in Figure 6.7 the least squares estimates lie 
outside of the diamond and the circle, and so the least squares estimates 
are not the same as the lasso and ridge regression estimates. 

The ellipses that are centered around $ represent regions of constant 
RSS. In other words, all of the points on a given ellipse share a common 
value of the RSS. As the ellipses expand away from the least squares co¬ 
efficient estimates, the RSS increases. Equations (6.8) and (6.9) indicate 
that the lasso and ridge regression coefficient estimates are given by the 
first point at which an ellipse contacts the constraint region. Since ridge 
regression has a circular constraint with no sharp points, this intersection 
will not generally occur on an axis, and so the ridge regression coefficient 
estimates will be exclusively non-zero. However, the lasso constraint has 
corners at each of the axes, and so the ellipse will often intersect the con¬ 
straint region at an axis. When this occurs, one of the coefficients will equal 
zero. In higher dimensions, many of the coefficient estimates may equal zero 
simultaneously. In Figure 6.7, the intersection occurs at /3i = 0, and so the 
resulting model will only include fa- 

In Figure 6.7, we considered the simple case of p = 2. When p = 3, 
then the constraint region for ridge regression becomes a sphere, and the 
constraint region for the lasso becomes a polyhedron. When p > 3, the 
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FIGURE 6.8. Left: Plots of squared bias (black), variance (green), and test MSE 
(purple) for the lasso on a simulated data set. Right: Comparison of squared bias, 
variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted 
against their R 2 on the training data, as a common form of indexing. The crosses 
in both plots indicate the lasso model for which the MSE is smallest. 


constraint for ridge regression becomes a hypersphere, and the constraint 
for the lasso becomes a polytope. However, the key ideas depicted in Fig¬ 
ure 6.7 still hold. In particular, the lasso leads to feature selection when 
p > 2 due to the sharp corners of the polyhedron or polytope. 

Comparing the Lasso and Ridge Regression 

It is clear that the lasso has a major advantage over ridge regression, in 
that it produces simpler and more interpretable models that involve only a 
subset of the predictors. However, which method leads to better prediction 
accuracy? Figure 6.8 displays the variance, squared bias, and test MSE of 
the lasso applied to the same simulated data as in Figure 6.5. Clearly the 
lasso leads to qualitatively similar behavior to ridge regression, in that as A 
increases, the variance decreases and the bias increases. In the right-hand 
panel of Figure 6.8, the dotted lines represent the ridge regression fits. 
Here we plot both against their R 2 on the training data. This is another 
useful way to index models, and can be used to compare models with 
different types of regularization, as is the case here. In this example, the 
lasso and ridge regression result in almost identical biases. However, the 
variance of ridge regression is slightly lower than the variance of the lasso. 
Consequently, the minimum MSE of ridge regression is slightly smaller than 
that of the lasso. 

However, the data in Figure 6.8 were generated in such a way that all 45 
predictors were related to the response—that is, none of the true coefficients 
/3i,..., /?45 equaled zero. The lasso implicitly assumes that a number of the 
coefficients truly equal zero. Consequently, it is not surprising that ridge 
regression outperforms the lasso in terms of prediction error in this setting. 
Figure 6.9 illustrates a similar situation, except that now the response is a 
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FIGURE 6.9. Left: Plots of squared bias (black), variance (green), and test MSE 
(purple) for the lasso. The simulated data is similar to that in Figure 6.8, except 
that now only two predictors are related to the response. Right: Comparison of 
squared bias, variance and test MSE between lasso (solid) and ridge (dashed). 
Both are plotted against their R 2 on the training data, as a common form of 
indexing. The crosses in both plots indicate the lasso model for which the MSE is 
smallest. 


function of only 2 out of 45 predictors. Now the lasso tends to outperform 
ridge regression in terms of bias, variance, and MSE. 

These two examples illustrate that neither ridge regression nor the lasso 
will universally dominate the other. In general, one might expect the lasso 
to perform better in a setting where a relatively small number of predictors 
have substantial coefficients, and the remaining predictors have coefficients 
that are very small or that equal zero. Ridge regression will perform better 
when the response is a function of many predictors, all with coefficients of 
roughly equal size. However, the number of predictors that is related to the 
response is never known a priori for real data sets. A technique such as 
cross-validation can be used in order to determine which approach is better 
on a particular data set. 

As with ridge regression, when the least squares estimates have exces¬ 
sively high variance, the lasso solution can yield a reduction in variance 
at the expense of a small increase in bias, and consequently can gener¬ 
ate more accurate predictions. Unlike ridge regression, the lasso performs 
variable selection, and hence results in models that are easier to interpret. 

There are very efficient algorithms for fitting both ridge and lasso models; 
in both cases the entire coefficient paths can be computed with about the 
same amount of work as a single least squares fit. We will explore this 
further in the lab at the end of this chapter. 

A Simple Special Case for Ridge Regression and the Lasso 

In order to obtain a better intuition about the behavior of ridge regression 
and the lasso, consider a simple special case with n = p, and X a diag¬ 
onal matrix with l’s on the diagonal and 0’s in all off-diagonal elements. 
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To simplify the problem further, assume also that we are performing regres¬ 
sion without an intercept. With these assumptions, the usual least squares 
problem simplifies to finding /3i,...,/3 p that minimize 

( 6 - 11 ) 

3=1 

In this case, the least squares solution is given by 

h = Vi- 

And in this setting, ridge regression amounts to finding f3 \ 1 ..., /3 p such that 

3=1 3=1 

is minimized, and the lasso amounts to finding the coefficients such that 

-Pi? + x ^\Pj\ ( 6 - 13 ) 

3=1 3=1 

is minimized. One can show that in this setting, the ridge regression esti¬ 
mates take the form 

/3f = % /(l + A), (6.14) 

and the lasso estimates take the form 

( yj - A/2 if y-j > A/2; 

$f=\vi + A/2 if y 3 < -A/2; (6.15) 

[o if \Vj\ < A/2. 

Figure 6.10 displays the situation. We can see that ridge regression and 
the lasso perform two very different types of shrinkage. In ridge regression, 
each least squares coefficient estimate is shrunken by the same proportion. 
In contrast, the lasso shrinks each least squares coefficient towards zero by 
a constant amount, A/2; the least squares coefficients that are less than 
A/2 in absolute value are shrunken entirely to zero. The type of shrink¬ 
age performed by the lasso in this simple setting (6.15) is known as soft- 
thresholding. The fact that some lasso coefficients are shrunken entirely to 
zero explains why the lasso performs feature selection. 

In the case of a more general data matrix X, the story is a little more 
complicated than what is depicted in Figure 6.10, but the main ideas still 
hold approximately: ridge regression more or less shrinks every dimension 
of the data by the same proportion, whereas the lasso more or less shrinks 
all coefficients toward zero by a similar amount, and sufficiently small co¬ 
efficients are shrunken all the way to zero. 


soft- 

thresholding 
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FIGURE 6.10. The ridge regression and lasso coefficient estimates for a simple 
setting with n = p and X a diagonal matrix with 1 ’s on the diagonal. Left: The 
ridge regression coefficient estimates are shrunken proportionally towards zero, 
relative to the least squares estimates. Right: The lasso coefficient estimates are 
soft-thresholded towards zero. 


Bayesian Interpretation for Ridge Regression and the Lasso 

We now show that one can view ridge regression and the lasso through 
a Bayesian lens. A Bayesian viewpoint for regression assumes that the 
coefficient vector /3 has some prior distribution, say p(/3), where (3 = 
(f3o,/3i, ■ ■ ■ ,/3 p ) T . The likelihood of the data can be written as f(Y\X,/3), 
where X = (X±,... ,X p ). Multiplying the prior distribution by the likeli¬ 
hood gives us (up to a proportionality constant) the posterior distribution , 
which takes the form 

p{P\X,Y) <x f(Y\X,f3)p((3\X) = f(Y\X,/3)p(J3), 

where the proportionality above follows from Bayes’ theorem, and the 
equality above follows from the assumption that X is fixed. 

We assume the usual linear model, 


L — fto + Xifdi + ... + X p /3p + e, 


and suppose that the errors are independent and drawn from a normal dis¬ 
tribution. Furthermore, assume that p((3) = for some density 

function g. It turns out that ridge regression and the lasso follow naturally 
from two special cases of g : 

• If g is a Gaussian distribution with mean zero and standard deviation 
a function of A, then it follows that the posterior mode for /3—that 
is, the most likely value for /3, given the data—is given by the ridge 
regression solution. (In fact, the ridge regression solution is also the 
posterior mean.) 
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FIGURE 6.11. Left: Ridge regression is the posterior mode for /3 under a Gaus¬ 
sian prior. Right: The lasso is the posterior mode for ft under a double-exponential 
prior. 


• If g is a double-exponential (Laplace) distribution with mean zero 
and scale parameter a function of A, then it follows that the posterior 
mode for /3 is the lasso solution. (However, the lasso solution is not 
the posterior mean, and in fact, the posterior mean does not yield a 
sparse coefficient vector.) 

The Gaussian and double-exponential priors are displayed in Figure 6.11. 
Therefore, from a Bayesian viewpoint, ridge regression and the lasso follow 
directly from assuming the usual linear model with normal errors, together 
with a simple prior distribution for /3. Notice that the lasso prior is steeply 
peaked at zero, while the Gaussian is flatter and fatter at zero. Hence, the 
lasso expects a priori that many of the coefficients are (exactly) zero, while 
ridge assumes the coefficients are randomly distributed about zero. 


6.2.3 Selecting the Tuning Parameter 

Just as the subset selection approaches considered in Section 6.1 require 
a method to determine which of the models under consideration is best, 
implementing ridge regression and the lasso requires a method for selecting 
a value for the tuning parameter A in (6.5) and (6.7), or equivalently, the 
value of the constraint s in (6.9) and (6.8). Cross-validation provides a sim¬ 
ple way to tackle this problem. We choose a grid of A values, and compute 
the cross-validation error for each value of A, as described in Chapter 5. We 
then select the tuning parameter value for which the cross-validation error 
is smallest. Finally, the model is re-fit using all of the available observations 
and the selected value of the tuning parameter. 

Figure 6.12 displays the choice of A that results from performing leave- 
one-out cross-validation on the ridge regression fits from the Credit data 
set. The dashed vertical lines indicate the selected value of A. In this case 
the value is relatively small, indicating that the optimal fit only involves a 
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FIGURE 6.12. Left: Cross-validation errors that result from 
regression to the Credit data set with various value of X. Right: 
estimates as a function of X. The vertical dashed lines indicate 
selected by cross-validation. 
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small amount of shrinkage relative to the least squares solution. In addition, 
the dip is not very pronounced, so there is rather a wide range of values 
that would give very similar error. In a case like this we might simply use 
the least squares solution. 

Figure 6.13 provides an illustration of ten-fold cross-validation applied to 
the lasso fits on the sparse simulated data from Figure 6.9. The left-hand 
panel of Figure 6.13 displays the cross-validation error, while the right-hand 
panel displays the coefficient estimates. The vertical dashed lines indicate 
the point at which the cross-validation error is smallest. The two colored 
lines in the right-hand panel of Figure 6.13 represent the two predictors 
that are related to the response, while the grey lines represent the unre¬ 
lated predictors; these are often referred to as signal and noise variables, 
respectively. Not only has the lasso correctly given much larger coeffi¬ 
cient estimates to the two signal predictors, but also the minimum cross- 
validation error corresponds to a set of coefficient estimates for which only 
the signal variables are non-zero. Hence cross-validation together with the 
lasso has correctly identified the two signal variables in the model, even 
though this is a challenging setting, with p = 45 variables and only n = 50 
observations. In contrast, the least squares solution—displayed on the far 
right of the right-hand panel of Figure 6.13— assigns a large coefficient 
estimate to only one of the two signal variables. 


6.3 Dimension Reduction Methods 


signal 


The methods that we have discussed so far in this chapter have controlled 
variance in two different ways, either by using a subset of the original vari¬ 
ables, or by shrinking their coefficients toward zero. All of these methods 
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FIGURE 6.13. Left: Ten-fold cross-validation MSE for the lasso, applied to 
the sparse simulated data set from Figure 6.9. Right: The corresponding lasso 
coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit 
for which the cross-validation error is smallest. 


are defined using the original predictors, X\, X2, ■ ■., X p . We now explore 
a class of approaches that transform the predictors and then fit a least 
squares model using the transformed variables. We will refer to these tech¬ 
niques as dimension reduction methods. 

Let Z\, Z 2 , ■ ■ ■ , Zm represent M < p linear combinations of our original 
p predictors. That is, 

p 

Z m = Y,^mX j ( 6 . 16 ) 

i=i 

for some constants (f>2m ■ ■ ■, 4 >prm m = 1,..., M. We can then fit the 
linear regression model 


M 

Pi — 6q T ''y \ 6 m Zi rn T U, i — 1, . . . , Tl, (6.17) 

m= 1 

using least squares. Note that in (6.17), the regression coefficients are given 
by 0 o, 6 * 1 ,..., 6m- If the constants 4>i m , ■ ■ ■, 4>pm are chosen wisely, then 

such dimension reduction approaches can often outperform least squares 
regression. In other words, fitting (6.17) using least squares can lead to 
better results than fitting ( 6 . 1 ) using least squares. 

The term dimension reduction comes from the fact that this approach 
reduces the problem of estimating the p +1 coefficients /3o, /3i, ■ ■ ■, f3 P to the 
simpler problem of estimating the M + 1 coefficients Oq , 6\ ,..., 6m , where 
M < p. In other words, the dimension of the problem has been reduced 
from p + 1 to M + 1. 

Notice that from (6.16), 
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FIGURE 6.14. The population size (pop) and ad spending (adj for 100 different 
cities are shown as purple circles. The green solid line indicates the first principal 
component, and the blue dashed line indicates the second principal component. 


where 


M 

m— 1 


(6.18) 


Hence (6.17) can be thought of as a special case of the original linear 
regression model given by (6.1). Dimension reduction serves to constrain 
the estimated Bj coefficients, since now they must take the form (6.18). 

This constraint on the form of the coefficients has the potential to bias the 
coefficient estimates. However, in situations where p is large relative to n, 
selecting a value of M <C p can significantly reduce the variance of the fitted 
coefficients. If M = p, and all the Z m are linearly independent, then (6.18) 
poses no constraints. In this case, no dimension reduction occurs, and so 
fitting (6.17) is equivalent to performing least squares on the original p 
predictors. 

All dimension reduction methods work in two steps. First, the trans¬ 
formed predictors Z±, Z 2 , ■ ■., Zm are obtained. Second, the model is fit 
using these M predictors. However, the choice of Z 2 , ■ ■ ■, Zm, or equiv¬ 
alently, the selection of the 0j rn 's, can be achieved in different ways. In this 
chapter, we will consider two approaches for this task: principal components 
and partial least squares. 

6.3.1 Principal Components Regression 

Principal components analysis (PCA) is a popular approach for deriving . . ^ 

a low-dimensional set of features from a large set of variables. PCA is components 
discussed in greater detail as a tool for unsupervised, learning in Chapter 10. anal y sls 
Here we describe its use as a dimension reduction technique for regression. 
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An Overview of Principal Components Analysis 

PCA is a technique for reducing the dimension of a n x p data matrix X. 
The first principal component direction of the data is that along which the 
observations vary the most. For instance, consider Figure 6.14, which shows 
population size (pop) in tens of thousands of people, and ad spending for a 
particular company (ad) in thousands of dollars, for 100 cities. The green 
solid line represents the first principal component direction of the data. We 
can see by eye that this is the direction along which there is the greatest 
variability in the data. That is, if we projected the 100 observations onto 
this line (as shown in the left-hand panel of Figure 6.15), then the resulting 
projected observations would have the largest possible variance; projecting 
the observations onto any other line would yield projected observations 
with lower variance. Projecting a point onto a line simply involves finding 
the location on the line which is closest to the point. 

The first principal component is displayed graphically in Figure 6.14, but 
how can it be summarized mathematically? It is given by the formula 

Z\ = 0.839 x (pop — pop) + 0.544 x (ad — ad). (6.19) 

Here fa\ = 0.839 and fai = 0.544 are the principal component loadings, 
which define the direction referred to above. In (6.19), pop indicates the 
mean of all pop values in this data set, and ad indicates the mean of all ad¬ 
vertising spending. The idea is that out of every possible linear combination 
of pop and ad such that this particular linear combination 

yields the highest variance: i.e. this is the linear combination for which 
Var(</>n x (pop — pop) + fa i x (ad — ad)) is maximized. It is necessary to 
consider only linear combinations of the form (fh+fai = since otherwise 
we could increase fa\ and fa\ arbitrarily in order to blow up the variance. 
In (6.19), the two loadings are both positive and have similar size, and so 
Z\ is almost an average of the two variables. 

Since n = 100, pop and ad are vectors of length 100, and so is Z\ in 
(6.19). For instance, 

Zn = 0.839 x (pop, — pop) + 0.544 x (ad, — ad). (6.20) 

The values of zn,, z n \ are known as the principal component scores , and 
can be seen in the right-hand panel of Figure 6.15. 

There is also another interpretation for PCA: the first principal compo¬ 
nent vector defines the line that is as close as possible to the data. For 
instance, in Figure 6.14, the first principal component line minimizes the 
sum of the squared perpendicular distances between each point and the 
line. These distances are plotted as dashed line segments in the left-hand 
panel of Figure 6.15, in which the crosses represent the projection of each 
point onto the first principal component line. The first principal component 
has been chosen so that the projected observations are as close as possible 
to the original observations. 
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FIGURE 6.15. A subset of the advertising data. The mean pop and ad budgets 
are indicated with a blue circle. Left: The first principal component direction is 
shown in green. It is the dimension along which the data vary the most, and it also 
defines the line that is closest to all n of the observations. The distances from each 
observation to the principal component are represented using the black dashed line 
segments. The blue dot represents (pop, ad). Right: The left-hand panel has been 
rotated so that the first principal component direction coincides with the x-axis. 


In the right-hand panel of Figure 6.15, the left-hand panel has been 
rotated so that the first principal component direction coincides with the 
x-axis. It is possible to show that the first principal component score for 
the ith observation, given in (6.20), is the distance in the x-direction of the 
*th cross from zero. So for example, the point in the bottom-left corner of 
the left-hand panel of Figure 6.15 has a large negative principal component 
score, Zn = —26.1, while the point in the top-right corner has a large 
positive score, Zn = 18.7. These scores can be computed directly using 
(6.20). 

We can think of the values of the principal component Z\ as single¬ 
number summaries of the joint pop and ad budgets for each location. In 
this example, if za = 0.839 x (pop* — pop) + 0.544 x (ad* — ad) < 0, 
then this indicates a city with below-average population size and below- 
average ad spending. A positive score suggests the opposite. How well can a 
single number represent both pop and ad? In this case, Figure 6.14 indicates 
that pop and ad have approximately a linear relationship, and so we might 
expect that a single-number summary will work well. Figure 6.16 displays 
Zn versus both pop and ad. The plots show a strong relationship between 
the first principal component and the two features. In other words, the first 
principal component appears to capture most of the information contained 
in the pop and ad predictors. 

So far we have concentrated on the first principal component. In gen¬ 
eral, one can construct up to p distinct principal components. The second 
principal component Zi is a linear combination of the variables that is un¬ 
correlated with Z\, and has largest variance subject to this constraint. The 
second principal component direction is illustrated as a dashed blue line in 
Figure 6.14. It turns out that the zero correlation condition of Z\ with Z 2 
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FIGURE 6.16. Plots of the first principal component scores zn versus pop and 
ad. The relationships are strong. 


is equivalent to the condition that the direction must be perpendicular , or 
orthogonal , to the first principal component direction. The second principal 
component is given by the formula 

Z 2 = 0.544 x (pop — pop) — 0.839 x (ad — ad). 

Since the advertising data has two predictors, the first two principal com¬ 
ponents contain all of the information that is in pop and ad. However, by 
construction, the first component will contain the most information. Con¬ 
sider, for example, the much larger variability of Zn (the axaxis) versus 
Zi 2 (the y-axis) in the right-hand panel of Figure 6.15. The fact that the 
second principal component scores are much closer to zero indicates that 
this component captures far less information. As another illustration, Fig¬ 
ure 6.17 displays Zi 2 versus pop and ad. There is little relationship between 
the second principal component and these two predictors, again suggesting 
that in this case, one only needs the first principal component in order to 
accurately represent the pop and ad budgets. 

With two-dimensional data, such as in our advertising example, we can 
construct at most two principal components. However, if we had other 
predictors, such as population age, income level, education, and so forth, 
then additional components could be constructed. They would successively 
maximize variance, subject to the constraint of being uncorrelated with the 
preceding components. 

The Principal Components Regression Approach 

The principal components regression (PCR) approach involves constructing 
the first M principal components, Z 1 ,..., Zm, and then using these compo¬ 
nents as the predictors in a linear regression model that is fit 
using least squares. The key idea is that often a small number of prin¬ 
cipal components suffice to explain most of the variability in the data, as 
well as the relationship with the response. In other words, we assume that 
the directions in which Xi,, X p show the most variation are the direc¬ 
tions that are associated with Y. While this assumption is not guaranteed 
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FIGURE 6.17. Plots of the second principal component scores Zi 2 versus pop 
and ad. The relationships are weak. 
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FIGURE 6.18. PCR was applied to two simulated data sets. Left: Simulated 
data from Figure 6.8. Right: Simulated data from Figure 6.9. 


to be true, it often turns out to be a reasonable enough approximation to 
give good results. 

If the assumption underlying PCR holds, then fitting a least squares 
model to Z 1 ,..., Zm will lead to better results than fitting a least squares 
model to X \,..., X. p . since most or all of the information in the data that 
relates to the response is contained in Z\..... Zm, and by estimating only 
M<p coefficients we can mitigate overfitting. In the advertising data, the 
first principal component explains most of the variance in both pop and ad, 
so a principal component regression that uses this single variable to predict 
some response of interest, such as sales, will likely perform quite well. 

Figure 6.18 displays the PCR fits on the simulated data sets from 
Figures 6.8 and 6.9. Recall that both data sets were generated using n = 50 
observations and p = 45 predictors. However, while the response in the first 
data set was a function of all the predictors, the response in the second data 
set was generated using only two of the predictors. The curves are plotted 
as a function of M , the number of principal components used as predic¬ 
tors in the regression model. As more principal components are used in 
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FIGURE 6.19. PCR, ridge regression, and the lasso were applied to a simulated 
data set in which the first five principal components of X contain all the informa¬ 
tion about the response Y. In each panel, the irreducible error Var(e) is shown as 
a horizontal dashed line. Left: Results for PCR. Right: Results for lasso (solid) 
and ridge regression (dotted). The x-axis displays the shrinkage factor of the co¬ 
efficient estimates, defined as the 1 2 norm of the shrunken coefficient estimates 
divided by the £2 norm of the least squares estimate. 


the regression model, the bias decreases, but the variance increases. This 
results in a typical U-shape for the mean squared error. When M = p = 45, 
then PCR amounts simply to a least squares fit using all of the original 
predictors. The figure indicates that performing PCR with an appropriate 
choice of M can result in a substantial improvement over least squares, es¬ 
pecially in the left-hand panel. However, by examining the ridge regression 
and lasso results in Figures 6.5, 6.8, and 6.9, we see that PCR does not 
perform as well as the two shrinkage methods in this example. 

The relatively worse performance of PCR in Figure 6.18 is a consequence 
of the fact that the data were generated in such a way that many princi¬ 
pal components are required in order to adequately model the response. 
In contrast, PCR will tend to do well in cases when the first few principal 
components are sufficient to capture most of the variation in the predictors 
as well as the relationship with the response. The left-hand panel of Fig¬ 
ure 6.19 illustrates the results from another simulated data set designed to 
be more favorable to PCR. Here the response was generated in such a way 
that it depends exclusively on the first five principal components. Now the 
bias drops to zero rapidly as M, the number of principal components used 
in PCR, increases. The mean squared error displays a clear minimum at 
M = 5. The right-hand panel of Figure 6.19 displays the results on these 
data using ridge regression and the lasso. All three methods offer a signif¬ 
icant improvement over least squares. However, PCR and ridge regression 
slightly outperform the lasso. 

We note that even though PCR provides a simple way to perform 
regression using M < p predictors, it is not a feature selection method. 
This is because each of the M principal components used in the regression 
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FIGURE 6.20. Left: PCR standardized coefficient estimates on the Credit data 
set for different values of M. Right: The ten-fold cross validation MSE obtained 
using PCR, as a function of M. 


is a linear combination of all p of the original features. For instance, in 
(6.19), Z\ was a linear combination of both pop and ad. Therefore, while 
PCR often performs quite well in many practical settings, it does not result 
in the development of a model that relies upon a small set of the original 
features. In this sense, PCR is more closely related to ridge regression than 
to the lasso. In fact, one can show that PCR and ridge regression are very 
closely related. One can even think of ridge regression as a continuous ver¬ 
sion of PCR! 4 

In PCR, the number of principal components, M, is typically chosen by 
cross-validation. The results of applying PCR to the Credit data set are 
shown in Figure 6.20; the right-hand panel displays the cross-validation 
errors obtained, as a function of M. On these data, the lowest cross- 
validation error occurs when there are M = 10 components; this corre¬ 
sponds to almost no dimension reduction at all, since PCR with M = 11 
is equivalent to simply performing least squares. 

When performing PCR, we generally recommend standardizing each 
predictor, using (6.6), prior to generating the principal components. This 
standardization ensures that all variables are on the same scale. In the 
absence of standardization, the high-variance variables will tend to play a 
larger role in the principal components obtained, and the scale on which 
the variables are measured will ultimately have an effect on the final PCR 
model. However, if the variables are all measured in the same units (say, 
kilograms, or inches), then one might choose not to standardize them. 


4 More details can be found in Section 3.5 of Elements of Statistical Learning by 
Hastie, Tibshirani, and Friedman. 
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FIGURE 6.21. For the advertising data, the first PLS direction (solid line) and 
first PCR direction (dotted line) are shown. 

6.3.2 Partial Least Squares 

The PCR approach that we just described involves identifying linear combi¬ 
nations, or directions , that best represent the predictors X \,..., X p . These 
directions are identified in an unsupervised way, since the response Y is not 
used to help determine the principal component directions. That is, the 
response does not supervise the identification of the principal components. 
Consequently, PCR suffers from a drawback: there is no guarantee that the 
directions that best explain the predictors will also be the best directions 
to use for predicting the response. Unsupervised methods are discussed 
further in Chapter 10. 

We now present partial least squares (PLS), a supervised alternative to 
PCR. Like PCR, PLS is a dimension reduction method, which first identifies 
a new set of features Z\, ... , Zm that are linear combinations of the original 
features, and then fits a linear model via least squares using these M new 
features. But unlike PCR, PLS identifies these new features in a supervised 
way—that is, it makes use of the response Y in order to identify new 
features that not only approximate the old features well, but also that are 
related to the response. Roughly speaking, the PLS approach attempts to 
find directions that help explain both the response and the predictors. 

We now describe how the first PLS direction is computed. After stan¬ 
dardizing the p predictors, PLS computes the first direction Z\ by setting 
each ifiji in (6.16) equal to the coefficient from the simple linear regression 
of Y onto Xj. One can show that this coefficient is proportional to the cor¬ 
relation between Y and Xj. Hence, in computing Z\ = Y^ P j =i < Aj'i Xj, PLS 
places the highest weight on the variables that are most strongly related 
to the response. 

Figure 6.21 displays an example of PLS on the advertising data. The solid 
green line indicates the first PLS direction, while the dotted line shows the 
first principal component direction. PLS has chosen a direction that has less 
change in the ad dimension per unit change in the pop dimension, relative 
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to PCA. This suggests that pop is more highly correlated with the response 
than is ad. The PLS direction does not fit the predictors as closely as does 
PCA, but it does a better job explaining the response. 

To identify the second PLS direction we first adjust each of the variables 
for Z \, by regressing each variable on Z\ and taking residuals. These resid¬ 
uals can be interpreted as the remaining information that has not been 
explained by the first PLS direction. We then compute Z 2 using this or- 
thogonalized data in exactly the same fashion as Z\ was computed based 
on the original data. This iterative approach can be repeated M times to 
identify multiple PLS components Zi,, Zm- Finally, at the end of this 
procedure, we use least squares to fit a linear model to predict Y using 
Z\,..., Zm in exactly the same fashion as for PCR. 

As with PCR, the number M of partial least squares directions used in 
PLS is a tuning parameter that is typically chosen by cross-validation. We 
generally standardize the predictors and response before performing PLS. 

PLS is popular in the field of chemometrics, where many variables arise 
from digitized spectrometry signals. In practice it often performs no better 
than ridge regression or PCR. While the supervised dimension reduction 
of PLS can reduce bias, it also has the potential to increase variance, so 
that the overall benefit of PLS relative to PCR is a wash. 


6.4 Considerations in High Dimensions 

6.4-1 High-Dimensional Data 

Most traditional statistical techniques for regression and classification are 
intended for the low-dimensional setting in which n, the number of ob¬ 
servations, is much greater than p, the number of features. This is due in 
part to the fact that throughout most of the field’s history, the bulk of sci¬ 
entific problems requiring the use of statistics have been low-dimensional. 
For instance, consider the task of developing a model to predict a patient’s 
blood pressure on the basis of his or her age, gender, and body mass index 
(BMI). There are three predictors, or four if an intercept is included in 
the model, and perhaps several thousand patients for whom blood pressure 
and age, gender, and BMI are available. Hence n^> p, and so the problem 
is low-dimensional. (By dimension here we are referring to the size of p.) 

In the past 20 years, new technologies have changed the way that data 
are collected in fields as diverse as finance, marketing, and medicine. It is 
now commonplace to collect an almost unlimited number of feature mea¬ 
surements (p very large). While p can be extremely large, the number of 
observations n is often limited due to cost, sample availability, or other 
considerations. Two examples are as follows: 

1. Rather than predicting blood pressure on the basis of just age, gen¬ 
der, and BMI, one might also collect measurements for half a million 
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single nucleotide polymorphisms (SNPs; these are individual DNA 
mutations that are relatively common in the population) for inclu¬ 
sion in the predictive model. Then n ~ 200 and p ~ 500,000. 

2. A marketing analyst interested in understanding people’s online shop¬ 
ping patterns could treat as features all of the search terms entered 
by users of a search engine. This is sometimes known as the “bag-of- 
words” model. The same researcher might have access to the search 
histories of only a few hundred or a few thousand search engine users 
who have consented to share their information with the researcher. 
For a given user, each of the p search terms is scored present (0) or 
absent (1), creating a large binary feature vector. Then n « 1,000 
and p is much larger. 

Data sets containing more features than observations are often referred 
to as high-dimensional. Classical approaches such as least squares linear 
regression are not appropriate in this setting. Many of the issues that arise 
in the analysis of high-dimensional data were discussed earlier in this book, 
since they apply also when n > p: these include the role of the bias-variance 
trade-off and the danger of overfitting. Though these issues are always rele¬ 
vant, they can become particularly important when the number of features 
is very large relative to the number of observations. 

We have defined the high-dimensional setting as the case where the num¬ 
ber of features p is larger than the number of observations n. But the con¬ 
siderations that we will now discuss certainly also apply if p is slightly 
smaller than n , and are best always kept in mind when performing super¬ 
vised learning. 


6-4-2 What Goes Wrong in High Dimensions? 

In order to illustrate the need for extra care and specialized techniques 
for regression and classification when p > n, we begin by examining what 
can go wrong if we apply a statistical technique not intended for the high- 
dimensional setting. For this purpose, we examine least squares regression. 
But the same concepts apply to logistic regression, linear discriminant anal¬ 
ysis, and other classical statistical approaches. 

When the number of features p is as large as, or larger than, the number 
of observations n, least squares as described in Chapter 3 cannot (or rather, 
should not) be performed. The reason is simple: regardless of whether or 
not there truly is a relationship between the features and the response, 
least squares will yield a set of coefficient estimates that result in a perfect 
fit to the data, such that the residuals are zero. 

An example is shown in Figure 6.22 with p = 1 feature (plus an intercept) 
in two cases: when there are 20 observations, and when there are only 
two observations. When there are 20 observations, n > p and the least 
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FIGURE 6.22. Left: Least squares regression in the low-dimensional setting. 
Right: Least squares regression with n = 2 observations and two parameters to be 
estimated (an intercept and a coefficient). 


squares regression line does not perfectly fit the data; instead, the regression 
line seeks to approximate the 20 observations as well as possible. On the 
other hand, when there are only two observations, then regardless of the 
values of those observations, the regression line will fit the data exactly. 
This is problematic because this perfect fit will almost certainly lead to 
overfitting of the data. In other words, though it is possible to perfectly fit 
the training data in the high-dimensional setting, the resulting linear model 
will perform extremely poorly on an independent test set, and therefore 
does not constitute a useful model. In fact, we can see that this happened 
in Figure 6.22: the least squares line obtained in the right-hand panel will 
perform very poorly on a test set comprised of the observations in the left- 
hand panel. The problem is simple: when p > n or p ss n, a simple least 
squares regression line is too flexible and hence overfits the data. 

Figure 6.23 further illustrates the risk of carelessly applying least squares 
when the number of features p is large. Data were simulated with n = 20 
observations, and regression was performed with between 1 and 20 features, 
each of which was completely unrelated to the response. As shown in the 
figure, the model R 2 increases to 1 as the number of features included in the 
model increases, and correspondingly the training set MSE decreases to 0 
as the number of features increases, even though the features are completely 
unrelated to the response. On the other hand, the MSE on an independent 
test set becomes extremely large as the number of features included in the 
model increases, because including the additional predictors leads to a vast 
increase in the variance of the coefficient estimates. Looking at the test 
set MSE, it is clear that the best model contains at most a few variables. 
However, someone who carelessly examines only the R 2 or the training set 
MSE might erroneously conclude that the model with the greatest number 
of variables is best. This indicates the importance of applying extra care 
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FIGURE 6.23. On a simulated example with n = 20 training observations, 
features that are completely unrelated to the outcome are added to the model. 
Left: The R 2 increases to 1 as more features are included. Center: The training 
set MSE decreases to 0 as more features are included. Right: The test set MSE 
increases as more features are included. 


when analyzing data sets with a large number of variables, and of always 
evaluating model performance on an independent test set. 

In Section 6.1.3, we saw a number of approaches for adjusting the training 
set RSS or R 2 in order to account for the number of variables used to fit 
a least squares model. Unfortunately, the C p , AIC, and BIC approaches 
are not appropriate in the high-dimensional setting, because estimating a 2 
is problematic. (For instance, the formula for a 2 from Chapter 3 yields an 
estimate a 2 = 0 in this setting.) Similarly, problems arise in the application 
of adjusted R 2 in the high-dimensional setting, since one can easily obtain 
a model with an adjusted R 2 value of 1. Clearly, alternative approaches 
that are better-suited to the high-dimensional setting are required. 


6-4-3 Regression in High Dimensions 

It turns out that many of the methods seen in this chapter for fitting 
less flexible least squares models, such as forward stepwise selection, ridge 
regression, the lasso, and principal components regression, are particularly 
useful for performing regression in the high-dimensional setting. Essentially, 
these approaches avoid overfitting by using a less flexible fitting approach 
than least squares. 

Figure 6.24 illustrates the performance of the lasso in a simple simulated 
example. There are p = 20, 50, or 2,000 features, of which 20 are truly 
associated with the outcome. The lasso was performed onn = 100 training 
observations, and the mean squared error was evaluated on an independent 
test set. As the number of features increases, the test set error increases. 
When p = 20, the lowest validation set error was achieved when A in 
(6.7) was small; however, when p was larger then the lowest validation 
set error was achieved using a larger value of A. In each boxplot, rather 
than reporting the values of A used, the degrees of freedom of the resulting 
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Degrees of Freedom Degrees of Freedom Degrees of Freedom 

FIGURE 6.24. The lasso was performed with n = 100 observations and three 
values of p, the number of features. Of the p features, 20 were associated with 
the response. The boxplots show the test MSEs that result using three different 
values of the tuning parameter A in (6.7). For ease of interpretation, rather than 
reporting \, the degrees of freedom are reported; for the lasso this turns out 
to be simply the number of estimated non-zero coefficients. When p = 20, the 
lowest test MSE was obtained with the smallest amount of regularization. When 
p = 50, the lowest test MSE was achieved when there is a substantial amount 
of regularization. When p = 2,000 the lasso performed poorly regardless of the 
amount of regularization, due to the fact that only 20 of the 2,000 features truly 
are associated with the outcome. 

lasso solution is displayed; this is simply the number of non-zero coefficient 
estimates in the lasso solution, and is a measure of the flexibility of the 
lasso fit. Figure 6.24 highlights three important points: (1) regularization 
or shrinkage plays a key role in high-dimensional problems, (2) appropriate 
tuning parameter selection is crucial for good predictive performance, and 
(3) the test error tends to increase as the dimensionality of the problem 
(i.e. the number of features or predictors) increases, unless the additional 
features are truly associated with the response. 

The third point above is in fact a key principle in the analysis of high¬ 
dimensional data, which is known as the curse of dimensionality. One might 
think that as the number of features used to fit a model increases, the 
quality of the fitted model will increase as well. However, comparing the 
left-hand and right-hand panels in Figure 6.24, we see that this is not 
necessarily the case: in this example, the test set MSE almost doubles as 
p increases from 20 to 2,000. In general, adding additional signal features 
that are truly associated with the response will improve the fitted model , 
in the sense of leading to a reduction in test set error. However, adding 
noise features that are not truly associated with the response will lead 
to a deterioration in the fitted model, and consequently an increased test 
set error. This is because noise features increase the dimensionality of the 
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problem, exacerbating the risk of overfitting (since noise features may be 
assigned nonzero coefficients due to chance associations with the response 
on the training set) without any potential upside in terms of improved test 
set error. Thus, we see that new technologies that allow for the collection 
of measurements for thousands or millions of features are a double-edged 
sword: they can lead to improved predictive models if these features are in 
fact relevant to the problem at hand, but will lead to worse results if the 
features are not relevant. Even if they are relevant, the variance incurred 
in fitting their coefficients may outweigh the reduction in bias that they 
bring. 

6 . 4.4 Interpreting Results in High Dimensions 

When we perform the lasso, ridge regression, or other regression proce¬ 
dures in the high-dimensional setting, we must be quite cautious in the way 
that we report the results obtained. In Chapter 3, we learned about multi- 
collinearity , the concept that the variables in a regression might be corre¬ 
lated with each other. In the high-dimensional setting, the multicollinearity 
problem is extreme: any variable in the model can be written as a linear 
combination of all of the other variables in the model. Essentially, this 
means that we can never know exactly which variables (if any) truly are 
predictive of the outcome, and we can never identify the best coefficients 
for use in the regression. At most, we can hope to assign large regression 
coefficients to variables that are correlated with the variables that truly are 
predictive of the outcome. 

For instance, suppose that we are trying to predict blood pressure on the 
basis of half a million SNPs, and that forward stepwise selection indicates 
that 17 of those SNPs lead to a good predictive model on the training data. 
It would be incorrect to conclude that these 17 SNPs predict blood pressure 
more effectively than the other SNPs not included in the model. There are 
likely to be many sets of 17 SNPs that would predict blood pressure just 
as well as the selected model. If we were to obtain an independent data set 
and perform forward stepwise selection on that data set, we would likely 
obtain a model containing a different, and perhaps even non-overlapping, 
set of SNPs. This does not detract from the value of the model obtained— 
for instance, the model might turn out to be very effective in predicting 
blood pressure on an independent set of patients, and might be clinically 
useful for physicians. But we must be careful not to overstate the results 
obtained, and to make it clear that what we have identified is simply one 
of many possible models for predicting blood pressure, and that it must be 
further validated on independent data sets. 

It is also important to be particularly careful in reporting errors and 
measures of model fit in the high-dimensional setting. We have seen that 
when p > n, it is easy to obtain a useless model that has zero residu¬ 
als. Therefore, one should never use sum of squared errors, p-values, B 2 
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statistics, or other traditional measures of model fit on the training data as 
evidence of a good model fit in the high-dimensional setting. For instance, 
as we saw in Figure 6.23, one can easily obtain a model with R 2 = 1 when 
p > n. Reporting this fact might mislead others into thinking that a sta¬ 
tistically valid and useful model has been obtained, whereas in fact this 
provides absolutely no evidence of a compelling model. It is important to 
instead report results on an independent test set, or cross-validation errors. 
For instance, the MSE or R 2 on an independent test set is a valid measure 
of model fit, but the MSE on the training set certainly is not. 


6.5 Lab 1: Subset Selection Methods 

6.5.1 Best Subset Selection 

Here we apply the best subset selection approach to the Hitters data. We 
wish to predict a baseball player’s Salary on the basis of various statistics 
associated with performance in the previous year. 

First of all, we note that the Salary variable is missing for some of the 
players. The is.naO function can be used to identify the missing observa¬ 
tions. It returns a vector of the same length as the input vector, with a TRUE 
for any elements that are missing, and a FALSE for non-missing elements. 
The sum() function can then be used to count all of the missing elements. 

> library(ISLR) 

> fix(Hitters) 

> names(Hitters) 


[1] 

"AtBat" 

"Hits " 

"HmRun" 

"Runs" 

"RBI " 

[6] 

"Walks" 

"Years " 

"CAtBat " 

"CHits " 

"CHmRun " 

[11] 

"CRuns " 

"CRBI" 

"CWalks" 

"League " 

"Division " 

[16] 

"PutOuts " 

"Assists " 

"Errors " 

"Salary" 

"NewLeague " 


> dim(Hitters) 

[1] 322 20 

> sum(is.na(Hitters$Salary)) 

[1] 59 

Hence we see that Salary is missing for 59 players. The na.omitO function 
removes all of the rows that have missing values in any variable. 

> Hitters=na.omit(Hitters) 

> dim(Hitters) 

[1] 263 20 

> sum (is.na(Hitters ) ) 

[ 1 ] 0 

The regsubsetsO function (part of the leaps library) performs best sub¬ 
set selection by identifying the best model that contains a given number 
of predictors, where best is quantified using RSS. The syntax is the same 
as for lm(). The summary() command outputs the best set of variables for 
each model size. 


is.na() 


sumO 


regsubsets () 
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> library(leaps) 

> regfit.full = regsubsets (Salary^. ,Hitters) 

> summary(regfit.full ) 

Subset selection object 

Call: regsubsets.formula(Salary ~ Hitters) 
19 Variables (and intercept) 


1 subsets of each size up to 8 
Selection Algorithm : exhaustive 

AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits 


CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts 


Assists Errors NewLeagueN 


An asterisk indicates that a given variable is included in the corresponding 
model. For instance, this output indicates that the best two-variable model 
contains only Hits and CRBI. By default, regsubsetsO only reports results 
up to the best eight-variable model. But the nvmax option can be used 
in order to return as many variables as are desired. Here we fit up to a 
19-variable model. 


> regfit.full = regsubsets (Salary^. ,data = Hitters ,nvmax = 19) 

> reg.summary=summary(regfit.full) 


The summaryO function also returns i? 2 , RSS, adjusted R 2 , C p , and BIC. 
We can examine these to try to select the best overall model. 

> names(reg.summary) 

[1] "which" "rsq" "rss" "adjr2" "cp" "bic" 

[7] "outmat " " ob j " 


























































246 


6. Linear Model Selection and Regularization 


For instance, we see that the R 2 statistic increases from 32 %, when only 
one variable is included in the model, to almost 55%, when all variables 
are included. As expected, the R 2 statistic increases monotonically as more 
variables are included. 

> reg.summary$rsq 

[1] 0.321 0.425 0.451 0.475 0.491 0.509 0.514 0.529 0.535 
[10] 0.540 0.543 0.544 0.544 0.545 0.545 0.546 0.546 0.546 
[19] 0.546 

Plotting RSS, adjusted R 2 , C p , and BIC for all of the models at once will 
help us decide which model to select. Note the type="l" option tells R to 
connect the plotted points with lines. 

> par(mfrow=c(2,2)) 

> plot(reg.summary$rss ,xlab = "Number of Variables ",ylab = "RSS", 

type="1") 

> plot(reg.summary$adjr2 ,xlab="Number of Variables", 

ylab = "Adj usted RSq",type = "1") 

The points() command works like the plot() command, except that it 
puts points on a plot that has already been created, instead of creating a 
new plot. The which.max() function can be used to identify the location of 
the maximum point of a vector. We will now plot a red dot to indicate the 
model with the largest adjusted R 2 statistic. 

> which.max(reg.summary$adjr2) 

[1] 11 

> points(11,reg.summary$adjr2 [11] , col = "red",cex=2,pch = 20) 

In a similar fashion we can plot the C p and BIC statistics, and indicate the 
models with the smallest statistic using which.minO. 

> plot(reg.summary$cp ,xlab = "Number of Variables ",ylab = "Cp", 

type= ’1’) 

> which.min(reg.summary$cp ) 

[1] 10 

> points(10,reg.summary$cp [10] ,col = "red",cex=2,pch = 20) 

> which . min (reg . summary$bic ) 

[1] 6 

> plot(reg.summary$bic ,xlab = "Number of Variables ",ylab = "BIC" , 

type =’1 ’ ) 

> points(6,reg.summary$bic [6],col="red",cex=2,pch=20) 

The regsubsetsO function has a built-in plot() command which can 
be used to display the selected variables for the best model with a given 
number of predictors, ranked according to the BIC, adjusted R 2 , or 
AIC. To find out more about this function, type ?plot .regsubsets. 

> plot (regf it . full , scale = " r2 " ) 

> plot(regfit.full,scale = "adjr2") 

> plot (regf it . full , scale = " Cp " ) 

> plot(regfit.full,scale="bic") 


points() 


which.minO 
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The top row of each plot contains a black square for each variable selected 
according to the optimal model associated with that statistic. For instance, 
we see that several models share a BIC close to —150. However, the model 
with the lowest BIC is the six-variable model that contains only AtBat, 
Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef () function 
to see the coefficient estimates associated with this model. 


> coef(regf 
(Intercept) 
91.512 
DivisionW 
-122.952 


t.full,6) 

AtBat 
-1.869 
PutOuts 
0.264 


Hits Walks 

7.604 3.698 


CRBI 
0.643 


6.5.2 Forward and Backward Stepwise Selection 

We can also use the regsubsetsO function to perform forward stepwise 
or backward stepwise selection, using the argument method="forward" or 
method="backward" . 

> regfit.fwd = regsubsets (Salary^. ,data=Hitters ,nvmax = 19, 

method = "forward ") 

> summary(regfit . fwd ) 

> regfit.bwd = regsubsets (Salary^. ,data=Hitters ,nvmax = 19, 

method="backward") 

> summary ( regf it . bwd ) 

For instance, we see that using forward stepwise selection, the best one- 
variable model contains only CRBI, and the best two-variable model ad¬ 
ditionally includes Hits. For this data, the best one-variable through six- 
variable models are each identical for best subset and forward selection. 
However, the best seven-variable models identified by forward stepwise se¬ 
lection, backward stepwise selection, and best subset selection are different. 

> coef (regf it . full , 7) 


(Intercept) 

Hits 

Walks 

CAtBat 

CHit s 

79.451 

1.283 

3.227 

-0.375 

1.496 

CHmRun 

DivisionW 

PutOuts 



1.442 

-129.987 

0.237 



> coef(regfit 

.fwd,7) 




(Intercept) 

AtBat 

Hits 

Walks 

CRBI 

109.787 

-1.959 

7.450 

4.913 

0.854 

CWalks 

DivisionW 

PutOuts 



-0.305 

-127.122 

0.253 



> coef(regfit 

.bwd , 7 ) 




(Intercept ) 

AtBat 

Hits 

Walks 

CRuns 

105.649 

-1.976 

6.757 

6.056 

1 . 129 

CWalks 

DivisionW 

PutOuts 



-0.716 

-116.169 

0.303 
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6.5.3 Choosing Among Models Using the Validation Set 
Approach and Cross-Validation 

We just saw that it is possible to choose among a set of models of different 
sizes using C p , BIC, and adjusted R 2 . We will now consider how to do this 
using the validation set and cross-validation approaches. 

In order for these approaches to yield accurate estimates of the test 
error, we must use only the training observations to perform all aspects of 
model-fitting- including variable selection. Therefore, the determination of 
which model of a given size is best must be made using only the training 
observations. This point is subtle but important. If the full data set is used 
to perform the best subset selection step, the validation set errors and 
cross-validation errors that we obtain will not be accurate estimates of the 
test error. 

In order to use the validation set approach, we begin by splitting the 
observations into a training set and a test set. We do this by creating 
a random vector, train, of elements equal to TRUE if the corresponding 
observation is in the training set, and FALSE otherwise. The vector test has 
a TRUE if the observation is in the test set, and a FALSE otherwise. Note the 
! in the command to create test causes TRUEs to be switched to FALSEs and 
vice versa. We also set a random seed so that the user will obtain the same 
training set/test set split. 

> set.seed (1) 

> train=sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE) 

> test=(!train) 

Now, we apply regsubsetsO to the training set in order to perform best 
subset selection. 

> regfit,best = regsubsets(Salary~. ,data = Hitters [train , ] , 

nvmax = 19) 

Notice that we subset the Hitters data frame directly in the call in or¬ 
der to access only the training subset of the data, using the expression 
Hitters [train,] . We now compute the validation set error for the best 
model of each model size. We first make a model matrix from the test 
data. 

test.mat = model.matrix (Salary~. ,data = Hitters[test ,]) 

The model.matrix!) function is used in many regression packages for build¬ 
ing an “X” matrix from data. Now we run a loop, and for each size i, we 
extract the coefficients from regfit.best for the best model of that size, 
multiply them into the appropriate columns of the test model matrix to 
form the predictions, and compute the test MSE. 

> val.errors=rep(NA,19) 

> for(i in 1: 19) { 

+ coefi=coef(regfit.best,id=i) 


model. 
matrix!) 
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+ pred = test .mat [, n am es(coefi)] coefi 

+ val.errors[i]=mean((Hitters$Salary[test]-pred)*2) 

} 

We find that the best model is the one that contains ten variables. 

> val.errors 

[1] 220968 169157 178518 163426 168418 171271 162377 157909 
[9] 154056 148162 151156 151742 152214 157359 158541 158743 
[17] 159973 159860 160106 

> which.min(val.errors) 

[ 1 ] 10 

> coef(regfit.best , 10) 


(Intercept ) 

AtBat 

Hits 

Walks 

CAtBat 

-80.275 

-1.468 

7.163 

3.643 

-0.186 

CHit s 

CHmRun 

CWalks 

LeagueN 

DivisionW 

1 . 105 

1.384 

-0.748 

84.558 

-53.029 


PutOuts 
0.238 


This was a little tedious, partly because there is no predict () method 
for regsubsetsO. Since we will be using this function again, we can capture 
our steps above and write our own predict method. 

> predict .regsubsets =function(object ,newdata,id,. . . ) { 

+ form = as.formula(objectjcall [[2]]) 

+ mat=model.matrix(form,newdata) 

+ coef i = coef (object ,id=id) 

+ xvars=names(coefi) 

+ mat [, xvar s ]'/,*'/, coef i 
+ > 


Our function pretty much mimics what we did above. The only complex 
part is how we extracted the formula used in the call to regsubsetsO. We 
demonstrate how we use this function below, when we do cross-validation. 

Finally, we perform best subset selection on the full data set, and select 
the best ten-variable model. It is important that we make use of the full 
data set in order to obtain more accurate coefficient estimates. Note that 
we perform best subset selection on the full data set and select the best ten- 
variable model, rather than simply using the variables that were obtained 
from the training set, because the best ten-variable model on the full data 
set may differ from the corresponding model on the training set. 


> regfit.best=regsubsets(Salary~. ,data = Hitters ,nvmax = 19) 


> coef(regfit 
(Intercept) 
162.535 
CRuns 
1.408 
Assists 


best,10) 
AtBat 
-2.169 
CRBI 
0.774 


Hits 
6.918 
CWalks 
-0.831 


Walks 
5.773 
DivisionW 
-112.380 


CAtBat 
-0.130 
PutOuts 
0.297 


0.283 
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In fact, we see that the best ten-variable model on the full data set has a 
different set of variables than the best ten-variable model on the training 
set. 

We now try to choose among the models of different sizes using cross- 
validation. This approach is somewhat involved, as we must perform best 
subset selection within each of the k training sets. Despite this, we see that 
with its clever subsetting syntax, R makes this job quite easy. First, we 
create a vector that allocates each observation to one of k = 10 folds, and 
we create a matrix in which we will store the results. 

> k = 10 

> set.seed (1) 

> folds = sample(l:k,nrow(Hitters) ,replace = TRUE) 

> cv.errors=matrix(NA,k , 19, dimnames = list(NULL , paste(1:19))) 

Now we write a for loop that performs cross-validation. In the jth fold, the 
elements of folds that equal j are in the test set, and the remainder are in 
the training set. We make our predictions for each model size (using our 
new predict 0 method), compute the test errors on the appropriate subset, 
and store them in the appropriate slot in the matrix cv.errors. 

> for (j in 1: k){ 

+ best.fit=regsubsets(Salary~.,data=Hitters[folds!=j,], 
nvmax = 19) 

+ ford in 1 : 19 ) { 

+ pred=predict(best.fit.Hitters[folds==j,],id=i) 

+ cv.errors[j,i]=mean( (Hitters$Salary[folds==j]-pred)“2) 

+ > 

+ > 

This has given us a 10 x 19 matrix, of which the (?, j)th element corresponds 
to the test MSE for the ith cross-validation fold for the best j-variable 
model. We use the apply () function to average over the columns of this 
matrix in order to obtain a vector for which the j th element is the cross- 
validation error for the j-variable model. 

> mean.cv.errors=apply(cv.errors ,2,mean) 

> mean.cv.errors 

[1] 160093 140197 153117 151159 146841 138303 144346 130208 
[9] 129460 125335 125154 128274 133461 133975 131826 131883 
[17] 132751 133096 132805 

> par(mfrow = c ( 1 ,1)) 

> plot(mean.cv.errors,type= 5 b’) 

We see that cross-validation selects an 11-variable model. We now perform 
best subset selection on the full data set in order to obtain the 11-variable 
model. 

> reg.best = regsubsets (Salary^. ,data = Hitters , nvmax = 19) 

> coef(reg.best , 11) 

(Intercept) AtBat Hits Walks CAtBat 

135.751 -2.128 6.924 5.620 -0.139 


apply() 
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CRuns 
1.455 
PutOuts 
0.289 


CRBI 
0.785 
Assists 
0.269 


CWalks LeagueN DivisionW 

-0.823 43.112 -111.146 


6.6 Lab 2: Ridge Regression and the Lasso 

We will use the glmnet package in order to perforin ridge regression and 
the lasso. The main function in this package is glmnet (), which can be used 
to fit ridge regression models, lasso models, and more. This function has 
slightly different syntax from other model-fitting functions that we have 
encountered thus far in this book. In particular, we must pass in an x 
matrix as well as a y vector, and we do not use the y ~ x syntax. We will 
now perform ridge regression and the lasso in order to predict Salary on 
the Hitters data. Before proceeding ensure that the missing values have 
been removed from the data, as described in Section 6.5. 

> x=model.matrix(Salary~..Hitters)[,-1] 

> y=Hitters$Salary 

The model .matrix () function is particularly useful for creating x; not only 
does it produce a matrix corresponding to the 19 predictors but it also 
automatically transforms any qualitative variables into dummy variables. 
The latter property is important because glmnet () can only take numerical, 
quantitative inputs. 


6 .6.1 Ridge Regression 

The glmnet () function has an alpha argument that determines what type 
of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=l 
then a lasso model is fit. We first fit a ridge regression model. 

> library(glmnet) 

> grid=10~seq(10,-2,length=100) 

> ridge.mod=glmnet(x,y,alpha=0,lambda=grid) 

By default the glmnet () function performs ridge regression for an automati¬ 
cally selected range of A values. However, here we have chosen to implement 
the function over a grid of values ranging from A = 10 10 to A = 10 -2 , es¬ 
sentially covering the full range of scenarios from the null model containing 
only the intercept, to the least squares fit. As we will see, we can also com¬ 
pute model fits for a particular value of A that is not one of the original 
grid values. Note that by default, the glmnet () function standardizes the 
variables so that they are on the same scale. To turn off this default setting, 
use the argument standardize=FALSE. 

Associated with each value of A is a vector of ridge regression coefficients, 
stored in a matrix that can be accessed by coef () . In this case, it is a 20x 100 


glmnet() 
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matrix, with 20 rows (one for each predictor, plus an intercept) and 100 
columns (one for each value of A). 

> dim ( coef (ridge . mod)) 

[ 1 ] 20 100 

We expect the coefficient estimates to be much smaller, in terms of £2 norm, 
when a large value of A is used, as compared to when a small value of A is 
used. These are the coefficients when A = 11,498, along with their £2 norm: 


> ridge.mod$lambda [50] 

[1] 11498 

> coef(ridge.mod)[,50] 


(Intercept ) 

AtBat 

Hits 

HmRun 

Runs 

407.356 

0.037 

0.138 

0.525 

0.231 

RBI 

Walks 

Years 

CAtBat 

CHit s 

0.240 

0.290 

1. 108 

0.003 

0.012 

CHmRun 

CRuns 

CRB I 

CWalks 

LeagueN 

0.088 

0.023 

0.024 

0.025 

0.085 

DivisionW 

PutOuts 

Assists 

Errors 

NewLeagueN 

-6.215 

0.016 

0.003 

-0.021 

0.301 

> sqrt(sum(coef ( 

ridge.mod) [ 

-1 ,50]*2) ) 




[1] 6.36 

In contrast, here are the coefficients when A = 705, along with their £2 
norm. Note the much larger £2 norm of the coefficients associated with this 
smaller value of A. 


> ridge.mod$lambda [60] 

[1] 705 

> coef(ridge.mod)[, 60] 


(Intercept ) 

AtBat 

Hits 

HmRun 

Runs 

54.325 

0.112 

0.656 

1 . 180 

0.938 

RBI 

Walks 

Years 

CAtBat 

CHit s 

0.847 

1.320 

2.596 

0.011 

0.047 

CHmRun 

CRuns 

CRB I 

CWalks 

LeagueN 

0.338 

0.094 

0.098 

0.072 

13.684 

DivisionW 

PutOuts 

Assists 

Errors 

NewLeagueN 

-54.659 

0.119 

0.016 

-0.704 

8.612 

> sqrt(sum(coef ( 

ridge.mod) [ 

-1,60]-2) ) 




[1] 57.1 

We can use the predict () function for a number of purposes. For instance, 
we can obtain the ridge regression coefficients for a new value of A, say 50: 


> predict(ridge 

.mod,s=50, 

,type = "coefficients") [1: 

: 20 , ] 

(Intercept) 

AtBat 

Hits 

HmRun 

Runs 

48.766 

-0.358 

1.969 

-1.278 

1 . 146 

RBI 

Walks 

Years 

CAtBat 

CHit s 

0.804 

2.716 

-6.218 

0.005 

0.106 

CHmRun 

CRuns 

CRB I 

CWalks 

LeagueN 

0.624 

0.221 

0.219 

-0.150 

45.926 

DivisionW 

PutOuts 

Assists 

Errors 

NewLeagueN 

-118.201 

0.250 

0.122 

-3.279 

-9.497 
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We now split the samples into a training set and a test set in order 
to estimate the test error of ridge regression and the lasso. There are two 
common ways to randomly split a data set. The first is to produce a random 
vector of TRUE, FALSE elements and select the observations corresponding to 
TRUE for the training data. The second is to randomly choose a subset of 
numbers between 1 and n; these can then be used as the indices for the 
training observations. The two approaches work equally well. We used the 
former method in Section 6.5.3. Here we demonstrate the latter approach. 

We first set a random seed so that the results obtained will be repro¬ 
ducible. 

> set.seed (1) 

> train=sample(1:nrow(x), nrow(x)/2) 

> test=(-train) 

> y.test=y [test] 

Next we fit a ridge regression model on the training set, and evaluate 
its MSE on the test set, using A = 4. Note the use of the predict() 
function again. This time we get predictions for a test set, by replacing 
type="coefficients" with the newx argument. 

> ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, 

thresh=le-12) 

> ridge.pred=predict(ridge.mod,s=4,newx=x[test,]) 

> mean((ridge.pred-y.test)~2) 

[1] 101037 

The test MSE is 101037. Note that if we had instead simply fit a model 
with just an intercept, we would have predicted each test observation using 
the mean of the training observations. In that case, we could compute the 
test set MSE like this: 

> mean((mean(y[train])-y.test) "2) 

[1] 193253 

We could also get the same result by fitting a ridge regression model with 
a very large value of A. Note that leio means 10 10 . 

> ridge.pred = predict(ridge.mod ,s = lelO,newx = x[test ,]) 

> mean((ridge.pred-y.test)~2) 

[1] 193253 

So fitting a ridge regression model with A = 4 leads to a much lower test 
MSE than fitting a model with just an intercept. We now check whether 
there is any benefit to performing ridge regression with A = 4 instead of 
just performing least squares regression. Recall that least squares is simply 
ridge regression with A = 0. 5 


5 In order for glmnetO to yield the exact least squares coefficients when A = 0, 
we use the argument exact=T when calling the predict () function. Otherwise, the 
predict () function will interpolate over the grid of A values used in fitting the 
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> ridge.pred = predict(ridge.mod,s=0,newx=x[test ,] ,exact=T) 

> mean((ridge.pred-y.test)‘2) 

[1] 114783 

> lm(y~x, subset=train) 

> predict(ridge.mod,s=0,exact=T,type="coefficients") [1:20,] 

In general, if we want to fit a (unpenalized) least squares model, then 
we should use the lm() function, since that function provides more useful 
outputs, such as standard errors and p-values for the coefficients. 

In general, instead of arbitrarily choosing A = 4, it would be better to 
use cross-validation to choose the tuning parameter A. We can do this using 
the built-in cross-validation function, cv.glmnetO. By default, the function 
performs ten-fold cross-validation, though this can be changed using the 
argument nfolds. Note that we set a random seed first so our results will 
be reproducible, since the choice of the cross-validation folds is random. 

> set.seed (1) 

> cv . out = c v . g linnet (x[train ,] ,y[train] ,alpha=0) 

> plot (cv . out) 

> bestlam = cv.out$lambda . min 

> bestlam 

[ 1 ] 212 

Therefore, we see that the value of A that results in the smallest cross- 
validation error is 212. What is the test MSE associated with this value of 
A? 

> ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,]) 

> mean((ridge.pred-y.test)~2) 

[1] 96016 

This represents a further improvement over the test MSE that we got using 
A = 4. Finally, we refit our ridge regression model on the full data set, 
using the value of A chosen by cross-validation, and examine the coefficient 
estimates. 


> out = glmnet (x , y , alpha =0) 

> predict(out,type = "coefficients",s = bestlam) [1:20,] 


(Intercept) 

AtBat 

Hits 

HmRun 

Runs 

9.8849 

0.0314 

1.0088 

0.1393 

1.1132 

RBI 

Walks 

Years 

CAtBat 

CHit s 

0.8732 

1.8041 

0.1307 

0.0111 

0.0649 

CHmRun 

CRuns 

CRB I 

CWalks 

LeagueN 

0.4516 

0.1290 

0.1374 

0.0291 

27.1823 

DivisionW 

PutOuts 

Assists 

Errors 

NewLeagueN 

-91.6341 

0.1915 

0.0425 

-1.8124 

7.2121 


glmnetO model, yielding approximate results. When we use exact=T, there remains 
a slight discrepancy in the third decimal place between the output of glmnetO when 
A = 0 and the output of lm () ; this is due to numerical approximation on the part of 
glmnet ( ) . 


.glmnet() 
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As expected, none of the coefficients are zero—ridge regression does not 
perform variable selection! 


6.6.2 The Lasso 

We saw that ridge regression with a wise choice of A can outperform least 
squares as well as the null model on the Hitters data set. We now ask 
whether the lasso can yield either a more accurate or a more interpretable 
model than ridge regression. In order to fit a lasso model, we once again 
use the glmnetO function; however, this time we use the argument alpha=l. 
Other than that change, we proceed just as we did in fitting a ridge model. 

> lasso . mod = g linnet (x [train ,] ,y[train] ,alpha = l, lambda = grid) 

> plot(lasso.mod) 

We can see from the coefficient plot that depending on the choice of tuning 
parameter, some of the coefficients will be exactly equal to zero. We now 
perform cross-validation and compute the associated test error. 

> set.seed (1) 

> cv.out=cv.glmnet(x[train,],y[train],alpha=l) 

> plot(cv.out ) 

> bestlam=cv.out$lambda.min 

> lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,]) 

> mean ((lasso . pred-y . test) ~2) 

[1] 100743 

This is substantially lower than the test set MSE of the null model and of 
least squares, and very similar to the test MSE of ridge regression with A 
chosen by cross-validation. 

However, the lasso has a substantial advantage over ridge regression in 
that the resulting coefficient estimates are sparse. Here we see that 12 of 
the 19 coefficient estimates are exactly zero. So the lasso model with A 
chosen by cross-validation contains only seven variables. 

> out = glmnet(x,y,alpha = 1,lambda = grid) 


> lasso.coef 

=predict(out 

,type="coefficients",s= 

bestlam) [1:20 

> lasso.coef 





(Intercept ) 

AtBat 

Hits 

HmRun 

Runs 

18.539 

0.000 

1.874 

0.000 

0.000 

RBI 

Walks 

Years 

CAtBat 

CHit s 

0.000 

2.218 

0.000 

0.000 

0 . 000 

CHmRun 

CRuns 

CRB I 

CWalks 

LeagueN 

0.000 

0.207 

0.413 

0.000 

3.267 

DivisionW 

PutOuts 

Assists 

Errors 

NewLeagueN 

-103.485 

0.220 

0.000 

0.000 

0.000 

> lasso.coef[lasso.coef ! 

= 0] 



(Intercept) 

Hits 

Walks 

CRuns 

CRB I 

18.539 

1.874 

2.218 

0.207 

0.413 

LeagueN 

DivisionW 

PutOuts 



3.267 

-103.485 

0.220 
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6.7 Lab 3: PCR and PLS Regression 

6. 7 .1 Principal Components Regression 

Principal components regression (PCR) can be performed using the pcr() 
function, which is part of the pis library. We now apply PCR to the Hitters 
data, in order to predict Salary. Again, ensure that the missing values have 
been removed from the data, as described in Section 6.5. 

> library (pis) 

> set.seed(2) 

> per.fit=pcr(Salary~., data = Hitters ,scale = TRUE, 

validation = "CV") 

The syntax for the pcr() function is similar to that for lm(), with a few 
additional options. Setting scale=TRUE has the effect of standardizing each 
predictor, using (6.6), prior to generating the principal components, so that 
the scale on which each variable is measured will not have an effect. Setting 
validation="CV" causes per 0 to compute the ten-fold cross-validation error 
for each possible value of M, the number of principal components used. The 
resulting fit can be examined using summary (). 

> summary (per . f it ) 

Data: X dimension: 263 19 

Y dimension : 263 1 
Fit method: svdpc 

Number of components considered : 19 


VALIDATION: RMSEP 

Cross - validated using 10 random segments. 


(Intercept ) 

1 comps 2 

comps 

3 comps 

4 comps 

CV 

452 

348.9 

352.2 

353.5 

352.8 

adjCV 

452 

348.7 

351.8 

352.9 

352.1 

TRAINING : 

% variance 

explained 






1 comps 

2 comps 

3 comps 

4 comps 

5 comps 

6 comps 

X 

38.31 

60.16 

70.84 

79.03 

84.29 

88.63 

Salary 

40.63 

41.58 

42.17 

43.22 

44.90 

46.48 


The CV score is provided for each possible number of components, ranging 
from M = 0 onwards. (We have printed the CV output only up to M = 4.) 
Note that pcr() reports the root mean squared error ; in order to obtain 
the usual MSE, we must square this quantity. For instance, a root mean 
squared error of 352.8 corresponds to an MSE of 352.8 2 = 124,468. 

One can also plot the cross-validation scores using the validationplot 0 
function. Using val.type="MSEP" will cause the cross-validation MSE to be 
plotted. 


pcr() 


validation 

plotQ 


> validationplot(pcr.fit,val.type="MSEP") 
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We see that the smallest cross-validation error occurs when M = 16 com¬ 
ponents are used. This is barely fewer than M = 19, which amounts to 
simply performing least squares, because when all of the components are 
used in PCR no dimension reduction occurs. However, from the plot we 
also see that the cross-validation error is roughly the same when only one 
component is included in the model. This suggests that a model that uses 
just a small number of components might suffice. 

The summary () function also provides the percentage of variance explained 
in the predictors and in the response using different numbers of compo¬ 
nents. This concept is discussed in greater detail in Chapter 10. Briefly, 
we can think of this as the amount of information about the predictors or 
the response that is captured using M principal components. For example, 
setting M = 1 only captures 38.31 % of all the variance, or information, in 
the predictors. In contrast, using M = 6 increases the value to 88.63%. If 
we were to use all M = p = 19 components, this would increase to 100 %. 

We now perform PCR on the training data and evaluate its test set 
performance. 

> set.seed (1) 

> per.fit=pcr(Salary^. , data=Hitters ,subset=train,scale=TRUE, 

valid at ion =" CV " ) 

> validationplot(per.fit,val.type="MSEP") 

Now we find that the lowest cross-validation error occurs when M = 7 
component are used. We compute the test MSE as follows. 

> pcr.pred = predict(per.fit ,x [test ,] , ncomp=7) 

> mean((per.pred-y.test)~2) 

[1] 96556 

This test set MSE is competitive with the results obtained using ridge re¬ 
gression and the lasso. However, as a result of the way PCR is implemented, 
the final model is more difficult to interpret because it does not perform 
any kind of variable selection or even directly produce coefficient estimates. 

Finally, we fit PCR on the full data set, using M = 7, the number of 
components identified by cross-validation. 

> pcr.fit=pcr (y~x , scale=TRUE , n comp =7) 

> summary (per . f it ) 

Data: X dimension: 263 19 

Y dimension : 263 1 


Fit method: svdpc 

Number of components considered : 7 
TRAINING : / variance explained 



1 comps 

2 comps 

3 comps 

4 comps 

5 comps 

6 comps 

X 

38.31 

60.16 

70.84 

79.03 

84.29 

88.63 

y 

40.63 

41.58 

42.17 

43.22 

44.90 

46.48 


7 comps 
X 92.26 
y 46.69 
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6.7.2 Partial Least Squares 

We implement partial least squares (PLS) using the plsrO function, also 
in the pis library. The syntax is just like that of the pcr() function. 

> set . seed (1) 

> pis.fit=plsr(Salary^. , data = Hitters ,subset=train,scale = TRUE, 

validat ion ="CV " ) 

> summary(pis.fit ) 

Data: X dimension: 131 19 

Y dimension : 131 1 
Fit method: kernelpls 
Number of components considered : 19 


VALIDATION : RMSEP 

Cross-validated using 10 random segments. 



(Intercept) 

1 comps 

2 comps 

3 comps 

4 comps 

CV 

464.6 

394.2 

391.5 

393.1 

395.0 

ad j CV 

464.6 

393.4 

390.2 

391.1 

392.9 


TRAINING : °/ 0 variance explained 



1 comps 

2 comps 

3 comps 

4 comps 

5 comps 

6 comps 

X 

38.12 

53.46 

66.05 

74.49 

79.33 

84.56 

Salary 

33.58 

38.96 

41.57 

42.43 

44.04 

45.59 


> validationplot(pls.fit,val.type="MSEP") 

The lowest cross-validation error occurs when only M — 2 partial least 
squares directions are used. We now evaluate the corresponding test set 
MSE. 

> pis.pred = predict(pis.fit } x [test ,] ,ncomp=2) 

> mean ( (pis . pred-y . test ) ~2) 

[1] 101417 

The test MSE is comparable to, but slightly higher than, the test MSE 
obtained using ridge regression, the lasso, and PCR. 

Finally, we perform PLS using the full data set, using M = 2, the number 
of components identified by cross-validation. 

> pis.fit=plsr(Salary^. , data = Hitters ,scale = TRUE,ncomp=2) 

> summary(pis.fit ) 

Data: X dimension: 263 19 

Y dimension : 263 1 
Fit method: kernelpls 
Number of components considered : 2 
TRAINING : / variance explained 

1 comps 2 comps 

X 38.08 51.03 

Salary 43.05 46.40 

Notice that the percentage of variance in Salary that the two-component 
PLS fit explains, 46.40%, is almost as much as that explained using the 


plsrO 
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final seven-component model PCR fit, 46.69%. This is because PCR only 
attempts to maximize the amount of variance explained in the predictors, 
while PLS searches for directions that explain variance in both the predic¬ 
tors and the response. 


6.8 Exercises 

Conceptual 

1. We perform best subset, forward stepwise, and backward stepwise 
selection on a single data set. For each approach, we obtain p + 1 
models, containing 0,1, 2,... ,p predictors. Explain your answers: 

(a) Which of the three models with k predictors has the smallest 
training RSS? 

(b) Which of the three models with k predictors has the smallest 
test RSS? 

(c) True or False: 

i. The predictors in the /c-variable model identified by forward 
stepwise are a subset of the predictors in the (fc +Invariable 
model identified by forward stepwise selection. 

ii. The predictors in the /c-variable model identified by back¬ 
ward stepwise are a subset of the predictors in the [k + In¬ 
variable model identified by backward stepwise selection. 

iii. The predictors in the /c-variable model identified by back¬ 
ward stepwise are a subset of the predictors in the (fc + In¬ 
variable model identified by forward stepwise selection. 

iv. The predictors in the /c-variable model identified by forward 
stepwise are a subset of the predictors in the (fc +Invariable 
model identified by backward stepwise selection. 

v. The predictors in the /c-variable model identified by best 
subset are a subset of the predictors in the (/c + l)-variable 
model identified by best subset selection. 

2. For parts (a) through (c), indicate which of i. through iv. is correct. 
Justify your answer. 

(a) The lasso, relative to least squares, is: 

i. More flexible and hence will give improved prediction ac¬ 
curacy when its increase in bias is less than its decrease in 
variance. 

ii. More flexible and hence will give improved prediction accu¬ 
racy when its increase in variance is less than its decrease 
in bias. 
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iii. Less flexible and hence will give improved prediction accu¬ 
racy when its increase in bias is less than its decrease in 
variance. 

iv. Less flexible and hence will give improved prediction accu¬ 
racy when its increase in variance is less than its decrease 
in bias. 

(b) Repeat (a) for ridge regression relative to least squares. 

(c) Repeat (a) for non-linear methods relative to least squares. 

3. Suppose we estimate the regression coefficients in a linear regression 
model by minimizing 



p 



subject to 


for a particular value of s. For parts (a) through (e), indicate which 
of i. through v. is correct. Justify your answer. 

(a) As we increase s from 0, the training RSS will: 

i. Increase initially, and then eventually start decreasing in an 
inverted U shape. 

ii. Decrease initially, and then eventually start increasing in a 
U shape. 

iii. Steadily increase. 

iv. Steadily decrease. 

v. Remain constant. 

(b) Repeat (a) for test RSS. 

(c) Repeat (a) for variance. 

(d) Repeat (a) for (squared) bias. 

(e) Repeat (a) for the irreducible error. 

4. Suppose we estimate the regression coefficients in a linear regression 
model by minimizing 



for a particular value of A. For parts (a) through (e), indicate which 
of i. through v. is correct. Justify your answer. 
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(a) As we increase A from 0, the training RSS will: 

i. Increase initially, and then eventually start decreasing in an 
inverted U shape. 

ii. Decrease initially, and then eventually start increasing in a 
U shape. 

iii. Steadily increase. 

iv. Steadily decrease. 

v. Remain constant. 

(b) Repeat (a) for test RSS. 

(c) Repeat (a) for variance. 

(d) Repeat (a) for (squared) bias. 

(e) Repeat (a) for the irreducible error. 

5. It is well-known that ridge regression tends to give similar coefficient 
values to correlated variables, whereas the lasso may give quite dif¬ 
ferent coefficient values to correlated variables. We will now explore 
this property in a very simple setting. 

Suppose that n = 2, p = 2, Xn = x\ 2 , X 21 = * 22 - Furthermore, 
suppose that yi +y 2 = 0 and £11 + X 21 = 0 and £12 + *22 = 0, so that 
the estimate for the intercept in a least squares, ridge regression, or 
lasso model is zero: /3o = 0. 

(a) Write out the ridge regression optimization problem in this set¬ 
ting. 

(b) Argue that in this setting, the ridge coefficient estimates satisfy 
Pi = p2- 

(c) Write out the lasso optimization problem in this setting. 

(d) Argue that in this setting, the lasso coefficients /3i and /?2 are 
not unique—in other words, there are many possible solutions 
to the optimization problem in (c). Describe these solutions. 

6. We will now explore (6.12) and (6.13) further. 

(a) Consider (6.12) with p = 1. For some choice of y\ and A > 0, 
plot (6.12) as a function of /3i. Your plot should confirm that 

(6.12) is solved by (6.14). 

(b) Consider (6.13) with p = 1. For some choice of y\ and A > 0, 
plot (6.13) as a function of f3\. Your plot should confirm that 

(6.13) is solved by (6.15). 
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7. We will now derive the Bayesian connection to the lasso and ridge 
regression discussed in Section 6.2.2. 


(a) Suppose that yi = fa + YAj=i Xijfa + ei where ei,..., e n are inde¬ 
pendent and identically distributed from a N( 0, er 2 ) distribution. 
Write out the likelihood for the data. 

(b) Assume the following prior for /3: fa,...,/3 p are independent 
and identically distributed according to a double-exponential 
distribution with mean 0 and common scale parameter b: i.e. 
p(f3) = exp(— \(3\/b). Write out the posterior for )3 in this 
setting. 

(c) Argue that the lasso estimate is the mode for /3 under this pos¬ 
terior distribution. 

(d) Now assume the following prior for /3: fa, ..., fa are independent 
and identically distributed according to a normal distribution 
with mean zero and variance c. Write out the posterior for /3 in 
this setting. 

(e) Argue that the ridge regression estimate is both the mode and 
the mean for /3 under this posterior distribution. 


Applied 

8. In this exercise, we will generate simulated data, and will then use 
this data to perform best subset selection. 

(a) Use the rnormO function to generate a predictor X of length 
n = 100, as well as a noise vector e of length n = 100. 

(b) Generate a response vector Y of length n = 100 according to 
the model 

Y = P o + faX + faX 2 + faX 3 + e, 

where fa, /3i, fa, and fa are constants of your choice. 

(c) Use the regsubsetsO function to perform best subset selection 
in order to choose the best model containing the predictors 
X, X 2 , ..., A' 10 . What is the best model obtained according to 
C p , BIC, and adjusted R 2 1 Show some plots to provide evidence 
for your answer, and report the coefficients of the best model ob¬ 
tained. Note you will need to use the data.frame() function to 
create a single data set containing both X and Y. 
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(d) Repeat (c), using forward stepwise selection and also using back¬ 
wards stepwise selection. How does your answer compare to the 
results in (c)? 

(e) Now fit a lasso model to the simulated data, again using X , X 2 , 
..., X 10 as predictors. Use cross-validation to select the optimal 
value of A. Create plots of the cross-validation error as a function 
of A. Report the resulting coefficient estimates, and discuss the 
results obtained. 

(f) Now generate a response vector Y according to the model 

Y = /3 0 + p 7 X 7 + e, 

and perform best subset selection and the lasso. Discuss the 
results obtained. 

9. In this exercise, we will predict the number of applications received 
using the other variables in the College data set. 

(a) Split the data set into a training set and a test set. 

(b) Fit a linear model using least squares on the training set, and 
report the test error obtained. 

(c) Fit a ridge regression model on the training set, with A chosen 
by cross-validation. Report the test error obtained. 

(d) Fit a lasso model on the training set, with A chosen by cross- 
validation. Report the test error obtained, along with the num¬ 
ber of non-zero coefficient estimates. 

(e) Fit a PCR model on the training set, with M chosen by cross- 
validation. Report the test error obtained, along with the value 
of M selected by cross-validation. 

(f) Fit a PLS model on the training set, with M chosen by cross- 
validation. Report the test error obtained, along with the value 
of M selected by cross-validation. 

(g) Comment on the results obtained. How accurately can we pre¬ 
dict the number of college applications received? Is there much 
difference among the test errors resulting from these five ap¬ 
proaches? 

10. We have seen that as the number of features used in a model increases, 
the training error will necessarily decrease, but the test error may not. 
We will now explore this in a simulated data set. 

(a) Generate a data set with p = 20 features, n = 1,000 observa¬ 
tions, and an associated quantitative response vector generated 
according to the model 

Y = X,B + e, 

where /? has some elements that are exactly equal to zero. 
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(b) Split your data set into a training set containing 100 observations 
and a test set containing 900 observations. 

(c) Perform best subset selection on the training set, and plot the 
training set MSE associated with the best model of each size. 

(d) Plot the test set MSE associated with the best model of each 
size. 

(e) For which model size does the test set MSE take on its minimum 
value? Comment on your results. If it takes on its minimum value 
for a model containing only an intercept or a model containing 
all of the features, then play around with the way that you are 
generating the data in (a) until you come up with a scenario in 
which the test set MSE is minimized for an intermediate model 
size. 

(f) How does the model at which the test set MSE is minimized 
compare to the true model used to generate the data? Comment 
on the coefficient values. 

(g) Create a plot displaying ~~ Pj) 2 f° r a ran g e of values 

of r, where /3J is the jth coefficient estimate for the best model 
containing r coefficients. Comment on what you observe. How 
does this compare to the test MSE plot from (d)? 

11. We will now try to predict per capita crime rate in the Boston data 
set. 

(a) Try out some of the regression methods explored in this chapter, 
such as best subset selection, the lasso, ridge regression, and 
PCR. Present and discuss results for the approaches that you 
consider. 

(b) Propose a model (or set of models) that seem to perform well on 
this data set, and justify your answer. Make sure that you are 
evaluating model performance using validation set error, cross- 
validation, or some other reasonable alternative, as opposed to 
using training error. 

(c) Does your chosen model involve all of the features in the data 
set? Why or why not? 



7 

Moving Beyond Linearity 


So far in this book, we have mostly focused on linear models. Linear models 
are relatively simple to describe and implement, and have advantages over 
other approaches in terms of interpretation and inference. However, stan¬ 
dard linear regression can have significant limitations in terms of predic¬ 
tive power. This is because the linearity assumption is almost always an 
approximation, and sometimes a poor one. In Chapter 6 we see that we can 
improve upon least squares using ridge regression, the lasso, principal com¬ 
ponents regression, and other techniques. In that setting, the improvement 
is obtained by reducing the complexity of the linear model, and hence the 
variance of the estimates. But we are still using a linear model, which can 
only be improved so far! In this chapter we relax the linearity assumption 
while still attempting to maintain as much interpretability as possible. We 
do this by examining very simple extensions of linear models like polyno¬ 
mial regression and step functions, as well as more sophisticated approaches 
such as splines, local regression, and generalized additive models. 

• Polynomial regression extends the linear model by adding extra pre¬ 
dictors, obtained by raising each of the original predictors to a power. 
For example, a cubic regression uses three variables, X, X 2 , and X 3 , 
as predictors. This approach provides a simple way to provide a non¬ 
linear fit to data. 

• Step functions cut the range of a variable into K distinct regions in 
order to produce a qualitative variable. This has the effect of fitting 
a piecewise constant function. 

G. James et al., An Introduction to Statistical Learning: with Applications in R, 265 
Springer Texts in Statistics, DOI 10.1007/978-1-4614-7138-7—7, 

© Springer Science+Business Media New York 2013 
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• Regression splines are more flexible than polynomials and step 
functions, and in fact are an extension of the two. They involve di¬ 
viding the range of X into K distinct regions. Within each region, 
a polynomial function is fit to the data. However, these polynomials 
are constrained so that they join smoothly at the region boundaries, 
or knots. Provided that the interval is divided into enough regions, 
this can produce an extremely flexible fit. 

• Smoothing splines are similar to regression splines, but arise in a 
slightly different situation. Smoothing splines result from minimizing 
a residual sum of squares criterion subject to a smoothness penalty. 

• Local regression is similar to splines, but differs in an important way. 
The regions are allowed to overlap, and indeed they do so in a very 
smooth way. 

• Generalized additive models allow us to extend the methods above to 
deal with multiple predictors. 

In Sections 7.1-7.6, we present a number of approaches for modeling the 
relationship between a response Y and a single predictor X in a flexible 
way. In Section 7.7, we show that these approaches can be seamlessly inte¬ 
grated in order to model a response Y as a function of several predictors 
X U ...,X P . 

7.1 Polynomial Regression 

Historically, the standard way to extend linear regression to settings in 
which the relationship between the predictors and the response is non¬ 
linear has been to replace the standard linear model 

Vi = A) + + Ci 


with a polynomial function 

Vi = A) + Pl%i + + 03%% + ■ ■ ■ + Pdxf + j (7-1) 

where is the error term. This approach is known as polynomial regression, 
and in fact we saw an example of this method in Section 3.3.2. For large 
enough degree d, a polynomial regression allows us to produce an extremely 
non-linear curve. Notice that the coefficients in (7.1) can be easily estimated 
using least squares linear regression because this is just a standard linear 
model with predictors X{, x |, x \,..., xf. Generally speaking, it is unusual 
to use d greater than 3 or 4 because for large values of d, the polynomial 
curve can become overly flexible and can take on some very strange shapes. 
This is especially true near the boundary of the X variable. 


polynomial 

regression 
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Degree-4 Polynomial 



FIGURE 7.1. The Wage data. Left: The solid blue curve is a degree-4 polynomial 
of wage (in thousands of dollars) as a function of age, fit by least squares. The 
dotted curves indicate an estimated 95 % confidence interval. Right: We model the 
binary event wage>250 using logistic regression, again with a degree-4 polynomial. 
The fitted posterior probability of wage exceeding $250,000 is shown in blue, along 
with an estimated 95 % confidence interval. 


The left-hand panel in Figure 7.1 is a plot of wage against age for the 
Wage data set, which contains income and demographic information for 
males who reside in the central Atlantic region of the United States. We 
see the results of fitting a degree-4 polynomial using least squares (solid 
blue curve). Even though this is a linear regression model like any other, 
the individual coefficients are not of particular interest. Instead, we look at 
the entire fitted function across a grid of 62 values for age from 18 to 80 in 
order to understand the relationship between age and wage. 

In Figure 7.1, a pair of dotted curves accompanies the fit; these are (2x) 
standard error curves. Let’s see how these arise. Suppose we have computed 
the fit at a particular value of age, xg: 

f(x o) = Po + PlXg + p 2 xl + foXg + PaXq. (7.2) 

What is the variance of the fit, i.e. Vaxf(xg)? Least squares returns variance 
estimates for each of the fitted coefficients (3j , as well as the covariances 
between pairs of coefficient estimates. We can use these to compute the 
estimated variance of /(xg). 1 The estimated pointwise standard error of 
f(x o) is the square-root of this variance. This computation is repeated 


1 If C is the 5x5 covariance matrix of the fij , and if /:)( = ( I . xq ■ x(, x) . x) ), then 
Var[/(x 0 )] = TffCto- 
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at each reference point xq, and we plot the fitted curve, as well as twice 
the standard error on either side of the fitted curve. We plot twice the 
standard error because, for normally distributed error terms, this quantity 
corresponds to an approximate 95 % confidence interval. 

It seems like the wages in Figure 7.1 are from two distinct populations: 
there appears to be a high earners group earning more than $250,000 per 
annum, as well as a low earners group. We can treat wage as a binary 
variable by splitting it into these two groups. Logistic regression can then 
be used to predict this binary response, using polynomial functions of age 
as predictors. In other words, we fit the model 


Pr (yi > 2501 Xi) 


exp(/3 0 + /3iXj + fd 2 x\ + . ■. + Pdxf) 

1 + exp(/3 0 + /3iXi + fox* + ... + fax?) 


(7.3) 


The result is shown in the right-hand panel of Figure 7.1. The gray marks 
on the top and bottom of the panel indicate the ages of the high earners 
and the low earners. The solid blue curve indicates the fitted probabilities 
of being a high earner, as a function of age. The estimated 95 % confidence 
interval is shown as well. We see that here the confidence intervals are fairly 
wide, especially on the right-hand side. Although the sample size for this 
data set is substantial (n = 3,000), there are only 79 high earners, which 
results in a high variance in the estimated coefficients and consequently 
wide confidence intervals. 


7.2 Step Functions 

Using polynomial functions of the features as predictors in a linear model 
imposes a global structure on the non-linear function of X. We can instead 
use step functions in order to avoid imposing such a global structure. Here 
we break the range of X into bins, and fit a different constant in each bin. 
This amounts to converting a continuous variable into an ordered categorical 
variable. 

In greater detail, we create cutpoints Ci, c 2 ,...,Ck in the range of X, 
and then construct K + 1 new variables 


C 0 (X) 

= I(X<a), 

Ci(X) 

= I(ci < X < c 2 ), 

C 2 (X) 

= 7(c 2 < X < C 3 ), 

Ck-i(X) 

= I{ck- 1 < X < ck 

C k (X) 

= I(c K <X ), 


where /(•) is an indicator function that returns a 1 if the condition is true, 
and returns a 0 otherwise. For example, I(ck < X ) equals 1 if cr- < X, and 


step function 


ordered 

categorical 

variable 


indicator 

function 
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Piecewise Constant 



FIGURE 7.2. The Wage data. Left: The solid curve displays the fitted value from 
a least squares regression of wage (in thousands of dollars) using step functions 
of age. The dotted curves indicate an estimated 95% confidence interval. Right: 
We model the binary event wage>250 using logistic regression, again using step 
functions of age. The fitted posterior probability of wage exceeding $250,000 is 
shown, along with an estimated 95 % confidence interval. 


equals 0 otherwise. These are sometimes called dummy variables. Notice 
that for any value of A", Co (A) + C'i(A) + ... + Ck{ A) = 1, since X must 
be in exactly one of the K + 1 intervals. We then use least squares to fit a 
linear model using Ci(A), C^A - ),..., Ck(X) as predictors 2 : 


Ui — P 0 + P\C\(Xi) + P 2 C 2 [Xi) + . ■ • + BkCk[X i) + ti- (7-5) 

For a given value of A", at most one of C\, C 2 ,..., Ck can be non-zero. 
Note that when X < ci, all of the predictors in (7.5) are zero, so Bo can 
be interpreted as the mean value of Y for X < c\. By comparison, (7.5) 
predicts a response of Bo+Bj for Cj < X < Cj+\ , so Bj represents the average 
increase in the response for X in Cj < X < Cj+\ relative to A < c\. 

An example of fitting step functions to the Wage data from Figure 7.1 is 
shown in the left-hand panel of Figure 7.2. We also fit the logistic regression 
model 


2 We exclude Co (A) as a predictor in (7.5) because it is redundant with the intercept. 
This is similar to the fact that we need only two dummy variables to code a qualitative 
variable with three levels, provided that the model will contain an intercept. The decision 
to exclude Co (A') instead of some other Cfc( X) in (7.5) is arbitrary. Alternatively, we 
could include Co(A), Ci(A),..., C/c (X). and exclude the intercept. 
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Pr (yi > 250|xi) 


exp(/3 0 + f3\Ci{xj) + ... + /3 K C K {xi)) 

1 + exp(/3 0 + PiC^Xi) + ... + /3 K C K {xi)) 


(7.6) 


in order to predict the probability that an individual is a high earner on the 
basis of age. The right-hand panel of Figure 7.2 displays the fitted posterior 
probabilities obtained using this approach. 

Unfortunately, unless there are natural breakpoints in the predictors, 
piecewise-constant functions can miss the action. For example, in the left- 
hand panel of Figure 7.2, the first bin clearly misses the increasing trend 
of wage with age. Nevertheless, step function approaches are very popular 
in biostatistics and epidemiology, among other disciplines. For example, 
5-year age groups are often used to define the bins. 


7.3 Basis Functions 

Polynomial and piecewise-constant regression models are in fact special 
cases of a basis function approach. The idea is to have at hand a fam¬ 
ily of functions or transformations that can be applied to a variable X: 
bi(X), b 2 {X ),..., bx(X). Instead of fitting a linear model in X, we fit the 
model 


Vi = A) + P\bi{xi) + j3 2 b 2 (xi) + /3 3 b 3 (xi ) + ... + /3 K b K (xi) + e*. (7.7) 

Note that the basis functions bi(-), b 2 (-), ■ • ■ > bx (•) are fixed and known. 
(In other words, we choose the functions ahead of time.) For polynomial 
regression, the basis functions are bj{xi) = x and for piecewise constant 
functions they are bj{xi) = I(cj < x, < Cj+i). We can think of (7.7) as 
a standard linear model with predictors bi(xi), b 2 (xi ),..., fciy(xj). Hence, 
we can use least squares to estimate the unknown regression coefficients 
in (7.7). Importantly, this means that all of the inference tools for linear 
models that are discussed in Chapter 3, such as standard errors for the 
coefficient estimates and F-statistics for the model’s overall significance, 
are available in this setting. 

Thus far we have considered the use of polynomial functions and piece- 
wise constant functions for our basis functions; however, many alternatives 
are possible. For instance, we can use wavelets or Fourier series to construct 
basis functions. In the next section, we investigate a very common choice 
for a basis function: regression splines. 
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7.4 Regression Splines 

Now we discuss a flexible class of basis functions that extends upon the 
polynomial regression and piecewise constant regression approaches that 
we have just seen. 


7-4-1 Piecewise Polynomials 

Instead of fitting a high-degree polynomial over the entire range of X , piece- 
wise polynomial regression involves fitting separate low-degree polynomials 
over different regions of X. For example, a piecewise cubic polynomial works 
by fitting a cubic regression model of the form 

Vi = Po + Pi%i + fhXi + foxf + £i, (7-8) 

where the coefficients /?o, Pi, and P$ differ in different parts of the range 
of X. The points where the coefficients change are called knots. 

For example, a piecewise cubic with no knots is just a standard cubic 
polynomial, as in (7.1) with d = 3. A piecewise cubic polynomial with a 
single knot at a point c takes the form 


Vi = 


Pm + PnXi + P21X? + Pmx? + ej 

P02 + Pl2Xi + P22xf + Pz 2X\ + €i 


if Xi < c; 
if Xi > c. 


In other words, we fit two different polynomial functions to the data, one 
on the subset of the observations with Xi < c, and one on the subset of 
the observations with x^ > c. The first polynomial function has coefficients 
Pm, Pih P21, P31, and the second has coefficients /? 02 , P12, P22, Pz2- Each of 
these polynomial functions can be fit using least squares applied to simple 
functions of the original predictor. 

Using more knots leads to a more flexible piecewise polynomial. In gen¬ 
eral, if we place K different knots throughout the range of X, then we 
will end up fitting K + 1 different cubic polynomials. Note that we do not 
need to use a cubic polynomial. For example, we can instead fit piecewise 
linear functions. In fact, our piecewise constant functions of Section 7.2 are 
piecewise polynomials of degree 0! 

The top left panel of Figure 7.3 shows a piecewise cubic polynomial fit to 
a subset of the Wage data, with a single knot at age=50. We immediately see 
a problem: the function is discontinuous and looks ridiculous! Since each 
polynomial has four parameters, we are using a total of eight degrees of 
freedom in fitting this piecewise polynomial model. 
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7-4-2 Constraints and Splines 

The top left panel of Figure 7.3 looks wrong because the fitted curve is just 
too flexible. To remedy this problem, we can fit a piecewise polynomial 
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Piecewise Cubic 


Continuous Piecewise Cubic 



Age 


Age 


FIGURE 7.3. Various piecewise polynomials are fit to a subset of the Wage 
data, with a knot at age=50. Top Left: The cubic polynomials are unconstrained. 
Top Right: The cubic polynomials are constrained to be continuous at age=50. 
Bottom Left: The cubic polynomials are constrained to be continuous, and to 
have continuous first and second derivatives. Bottom Right: A linear spline is 
shown, which is constrained to be continuous. 


under the constraint that the fitted curve must be continuous. In other 
words, there cannot be a jump when age=50. The top right plot in Figure 7.3 
shows the resulting fit. This looks better than the top left plot, but the V- 
shaped join looks unnatural. 

In the lower left plot, we have added two additional constraints: now both 
the first and second derivatives of the piecewise polynomials are continuous 

derivative 

at age=50. In other words, we are requiring that the piecewise polynomial 
be not only continuous when age=50, but also very smooth. Each constraint 
that we impose on the piecewise cubic polynomials effectively frees up one 
degree of freedom, by reducing the complexity of the resulting piecewise 
polynomial fit. So in the top left plot, we are using eight degrees of free¬ 
dom, but in the bottom left plot we imposed three constraints (continuity, 
continuity of the first derivative, and continuity of the second derivative) 
and so are left with five degrees of freedom. The curve in the bottom left 
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plot is called a cubic spline. 3 In general, a cubic spline with K knots uses 
a total of 4 + K degrees of freedom. 

In Figure 7.3, the lower right plot is a linear spline , which is continuous 
at age=50. The general definition of a degree-G? spline is that it is a piecewise 
degree-d polynomial, with continuity in derivatives up to degree d — 1 at 
each knot. Therefore, a linear spline is obtained by fitting a line in each 
region of the predictor space defined by the knots, requiring continuity at 
each knot. 

In Figure 7.3, there is a single knot at age=50. Of course, we could add 
more knots, and impose continuity at each. 


7-4-3 The Spline Basis Representation 

The regression splines that we just saw in the previous section may have 
seemed somewhat complex: how can we fit a piecewise degree-d polynomial 
under the constraint that it (and possibly its first d — 1 derivatives) be 
continuous? It turns out that we can use the basis model (7.7) to represent 
a regression spline. A cubic spline with K knots can be modeled as 


Vi = P o + PibRxi) + fcbilxi) H-1- p K+3 b K +3(xi) + e i: (7.9) 


for an appropriate choice of basis functions bi, b%, ■ ■ ■, bx+ 3 - The model 
(7.9) can then be fit using least squares. 

Just as there were several ways to represent polynomials, there are also 
many equivalent ways to represent cubic splines using different choices of 
basis functions in (7.9). The most direct way to represent a cubic spline 
using (7.9) is to start off with a basis for a cubic polynomial- namely, 
x,x 2 ,x 3 —and then add one truncated power basis function per knot. 
A truncated power basis function is defined as 


* 0,0 = 0 “ 0 + = 


0 — 0 3 if x > £ 

0 otherwise, 


(7.10) 


where £ is the knot. One can show that adding a term of the form /? 4 h(x, £) 
to the model (7.8) for a cubic polynomial will lead to a discontinuity in 
only the third derivative at £; the function will remain continuous, with 
continuous first and second derivatives, at each of the knots. 

In other words, in order to fit a cubic spline to a data set with K knots, we 
perform least squares regression with an intercept and 3 + K predictors, of 
the form X, X 2 , X 3 , h(X, ), h(X, £ 2 ), • • ■, h(X, £k), where £i,..., are 
the knots. This amounts to estimating a total of K + 4 regression coeffi¬ 
cients; for this reason, fitting a cubic spline with K knots uses K +4 degrees 
of freedom. 


3 Cubic splines are popular because most human eyes cannot detect the discontinuity 
at the knots. 
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FIGURE 7.4. A cubic spline and a natural cubic spline, with three knots, fit to 
a subset of the Wage data. 


Unfortunately, splines can have high variance at the outer range of the 
predictors—that is, when X takes on either a very small or very large 
value. Figure 7.4 shows a fit to the Wage data with three knots. We see that 
the confidence bands in the boundary region appear fairly wild. A natu¬ 
ral spline is a regression spline with additional boundary constraints: the 
function is required to be linear at the boundary (in the region where X is 
smaller than the smallest knot, or larger than the largest knot). This addi¬ 
tional constraint means that natural splines generally produce more stable 
estimates at the boundaries. In Figure 7.4, a natural cubic spline is also 
displayed as a red line. Note that the corresponding confidence intervals 
are narrower. 


7-4-4 Choosing the Number and Locations of the Knots 

When we fit a spline, where should we place the knots? The regression 
spline is most flexible in regions that contain a lot of knots, because in 
those regions the polynomial coefficients can change rapidly. Hence, one 
option is to place more knots in places where we feel the function might 
vary most rapidly, and to place fewer knots where it seems more stable. 
While this option can work well, in practice it is common to place knots in 
a uniform fashion. One way to do this is to specify the desired degrees of 
freedom, and then have the software automatically place the corresponding 
number of knots at uniform quantiles of the data. 

Figure 7.5 shows an example on the Wage data. As in Figure 7.4, we 
have fit a natural cubic spline with three knots, except this time the knot 
locations were chosen automatically as the 25th, 50th, and 75th percentiles 


natural 

spline 
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Natural Cubic Spline 



FIGURE 7.5. A natural cubic spline function with four degrees of freedom is 
fit to the Wage data. Left: A spline is fit to wage (in thousands of dollars) as 
a function of age. Right: Logistic regression is used to model the binary event 
wage>250 as a function of age. The fitted posterior probability of wage exceeding 
$250,000 is shown. 


of age. This was specified by requesting four degrees of freedom. The ar¬ 
gument by which four degrees of freedom leads to three interior knots is 
somewhat technical. 4 

How many knots should we use, or equivalently how many degrees of 
freedom should our spline contain? One option is to try out different num¬ 
bers of knots and see which produces the best looking curve. A somewhat 
more objective approach is to use cross-validation, as discussed in Chap¬ 
ters 5 and 6. With this method, we remove a portion of the data (say 10 %), 
fit a spline with a certain number of knots to the remaining data, and then 
use the spline to make predictions for the held-out portion. We repeat this 
process multiple times until each observation has been left out once, and 
then compute the overall cross-validated RSS. This procedure can be re¬ 
peated for different numbers of knots K. Then the value of K giving the 
smallest RSS is chosen. 


4 There are actually five knots, including the two boundary knots. A cubic spline 
with five knots would have nine degrees of freedom. But natural cubic splines have two 
additional natural constraints at each boundary to enforce linearity, resulting in 9 —4 = 5 
degrees of freedom. Since this includes a constant, which is absorbed in the intercept, 
we count it as four degrees of freedom. 
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Degrees of Freedom of Natural Spline Degrees of Freedom of Cubic Spline 


FIGURE 7.6. Ten-fold cross-validated mean squared errors for selecting the 
degrees of freedom when fitting splines to the Wage data. The response is wage 
and the predictor age. Left: A natural cubic spline. Right: A cubic spline. 


Figure 7.6 shows ten-fold cross-validated mean squared errors for splines 
with various degrees of freedom fit to the Wage data. The left-hand panel 
corresponds to a natural spline and the right-hand panel to a cubic spline. 
The two methods produce almost identical results, with clear evidence that 
a one-degree fit (a linear regression) is not adequate. Both curves flatten 
out quickly, and it seems that three degrees of freedom for the natural 
spline and four degrees of freedom for the cubic spline are quite adequate. 

In Section 7.7 we fit additive spline models simultaneously on several 
variables at a time. This could potentially require the selection of degrees 
of freedom for each variable. In cases like this we typically adopt a more 
pragmatic approach and set the degrees of freedom to a fixed number, say 
four, for all terms. 


7-4-5 Comparison to Polynomial Regression 

Regression splines often give superior results to polynomial regression. This 
is because unlike polynomials, which must use a high degree (exponent in 
the highest monomial term, e.g. A' 15 ) to produce flexible fits, splines intro¬ 
duce flexibility by increasing the number of knots but keeping the degree 
fixed. Generally, this approach produces more stable estimates. Splines also 
allow us to place more knots, and hence flexibility, over regions where the 
function / seems to be changing rapidly, and fewer knots where / appears 
more stable. Figure 7.7 compares a natural cubic spline with 15 degrees of 
freedom to a degree-15 polynomial on the Wage data set. The extra flexibil¬ 
ity in the polynomial produces undesirable results at the boundaries, while 
the natural cubic spline still provides a reasonable fit to the data. 
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FIGURE 7.7. On the Wage data set, a natural cubic spline with 15 degrees 
of freedom is compared to a degree-15 polynomial. Polynomials can show wild 
behavior, especially near the tails. 


7.5 Smoothing Splines 


7.5.1 An Overview of Smoothing Splines 

In the last section we discussed regression splines, which we create by spec¬ 
ifying a set of knots, producing a sequence of basis functions, and then 
using least squares to estimate the spline coefficients. We now introduce a 
somewhat different approach that also produces a spline. 

In fitting a smooth curve to a set of data, what we really want to do is 
find some function, say g(x), that fits the observed data well: that is, we 
want RSS = fff-'-i (y-i — g{%i)) 2 to be small. However, there is a problem 
with this approach. If we don’t put any constraints on g{xf), then we can 
always make RSS zero simply by choosing g such that it interpolates all 
of the yi. Such a function would woefully overfit the data—it would be far 
too flexible. What we really want is a function g that makes RSS small, 
but that is also smooth. 

How might we ensure that g is smooth? There are a number of ways to 
do this. A natural approach is to find the function g that minimizes 

{Vi - g(xi)) 2 + A 

2—1 



g"(tfdt 


(7.11) 


where A is a nonnegative tuning parameter. The function g that minimizes 
(7.11) is known as a smoothing spline. 

What does (7.11) mean? Equation 7.11 takes the “Loss+Penalty” for¬ 
mulation that we encounter in the context of ridge regression and the lasso 
in Chapter 6. The term ~~ d( x i )) 2 is a l° ss function that encour¬ 

ages g to fit the data well, and the term A f g"{t) 2 dt is a penalty term 
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that penalizes the variability in g. The notation g"(t) indicates the second 
derivative of the function g. The first derivative g'(t) measures the slope 
of a function at t, and the second derivative corresponds to the amount by 
which the slope is changing. Hence, broadly speaking, the second derivative 
of a function is a measure of its roughness: it is large in absolute value if 
g(t) is very wiggly near t, and it is close to zero otherwise. (The second 
derivative of a straight line is zero; note that a line is perfectly smooth.) 
The f notation is an integral , which we can think of as a summation over 
the range of t. In other words, J g"(t) 2 dt is simply a measure of the total 
change in the function g'(t), over its entire range. If g is very smooth, then 
g'(t) will be close to constant and f g"(t) 2 dt will take on a small value. 
Conversely, if g is jumpy and variable then g'(t) will vary significantly and 
J g"(t) 2 dt will take on a large value. Therefore, in (7.11), A J g"(t) 2 dt. en¬ 
courages g to be smooth. The larger the value of A, the smoother g will be. 

When A = 0, then the penalty term in (7.11) has no effect, and so the 
function g will be very jumpy and will exactly interpolate the training 
observations. When A —> oo, g will be perfectly smooth—it will just be 
a straight line that passes as closely as possible to the training points. 
In fact, in this case, g will be the linear least squares line, since the loss 
function in (7.11) amounts to minimizing the residual sum of squares. For 
an intermediate value of A, g will approximate the training observations 
but will be somewhat smooth. We see that A controls the bias-variance 
trade-off of the smoothing spline. 

The function g(x) that minimizes (7.11) can be shown to have some spe¬ 
cial properties: it is a piecewise cubic polynomial with knots at the unique 
values of Xi,... ,x n , and continuous first and second derivatives at each 
knot. Furthermore, it is linear in the region outside of the extreme knots. 
In other words, the function g(x) that minimizes ( 7 . 11 ) is a natural cubic 
spline with knots at xi,... ,x n ! However, it is not the same natural cubic 
spline that one would get if one applied the basis function approach de¬ 
scribed in Section 7.4.3 with knots at aq, ... ,x n —rather, it is a shrunken 
version of such a natural cubic spline, where the value of the tuning pa¬ 
rameter A in (7.11) controls the level of shrinkage. 

7.5.2 Choosing the Smoothing Parameter X 

We have seen that a smoothing spline is simply a natural cubic spline 
with knots at every unique value of ay. It might seem that a smoothing 
spline will have far too many degrees of freedom, since a knot at each data 
point allows a great deal of flexibility. But the tuning parameter A controls 
the roughness of the smoothing spline, and hence the effective degrees of 
freedom. It is possible to show that as A increases from 0 to oo, the effective 
degrees of freedom, which we write df\ , decrease from n to 2. 

In the context of smoothing splines, why do we discuss effective degrees 
of freedom instead of degrees of freedom? Usually degrees of freedom refer 
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to the number of free parameters, such as the number of coefficients fit in a 
polynomial or cubic spline. Although a smoothing spline has n parameters 
and hence n nominal degrees of freedom, these n parameters are heavily 
constrained or shrunk down. Hence df\ is a measure of the flexibility of the 
smoothing spline—the higher it is, the more flexible (and the lower-bias but 
higher-variance) the smoothing spline. The definition of effective degrees of 
freedom is somewhat technical. We can write 


gA = S A y, 


(7.12) 


where g is the solution to (7.11) for a particular choice of A—that is, it is a 
n -vector containing the fitted values of the smoothing spline at the training 
points xi,... ,x n . Equation 7.12 indicates that the vector of fitted values 
when applying a smoothing spline to the data can be written as a n x n 
matrix S A (for which there is a formula) times the response vector y. Then 
the effective degrees of freedom is defined to be 

n 

df x = Y,i Sa}», (7-13) 

i=l 


the sum of the diagonal elements of the matrix S A . 

In fitting a smoothing spline, we do not need to select the number or 
location of the knots—there will be a knot at each training observation, 
xi,..., x n . Instead, we have another problem: we need to choose the value 
of A. It should come as no surprise that one possible solution to this problem 
is cross-validation. In other words, we can find the value of A that makes 
the cross-validated RSS as small as possible. It turns out that the leave- 
one-out cross-validation error (LOOCV) can be computed very efficiently 
for smoothing splines, with essentially the same cost as computing a single 
fit, using the following formula: 


RSS C ,(A) = - gi^ixi)) 2 = £ 

2=1 2=1 


Vi - g\(xj)~ 


The notation g x l \xi) indicates the fitted value for this smoothing spline 
evaluated at Xj, where the fit uses all of the training observations except 
for the ith observation (Xi,j/j). In contrast, g\(xi) indicates the smoothing 
spline function fit to all of the training observations and evaluated at Xj. 
This remarkable formula says that we can compute each of these leave- 
one-out fits using only g \, the original fit to all of the data! 5 We have 
a very similar formula (5.2) on page 180 in Chapter 5 for least squares 
linear regression. Using (5.2), we can very quickly perform LOOCV for 
the regression splines discussed earlier in this chapter, as well as for least 
squares regression using arbitrary basis functions. 


5 The exact formulas for computing g{xi) and S A are very technical; however, efficient 
algorithms are available for computing these quantities. 
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Smoothing Spline 



FIGURE 7.8. Smoothing spline fits to the Wage data. The red curve results 
from specifying 16 effective degrees of freedom. For the blue curve, A was found 
automatically by leave-one-out cross-validation, which resulted in 6.8 effective 
degrees of freedom. 

Figure 7.8 shows the results from fitting a smoothing spline to the Wage 
data. The red curve indicates the fit obtained from pre-specifying that we 
would like a smoothing spline with 16 effective degrees of freedom. The blue 
curve is the smoothing spline obtained when A is chosen using LOOCV; in 
this case, the value of A chosen results in 6.8 effective degrees of freedom 
(computed using (7.13)). For this data, there is little discernible difference 
between the two smoothing splines, beyond the fact that the one with 16 
degrees of freedom seems slightly wigglier. Since there is little difference 
between the two fits, the smoothing spline fit with 6.8 degrees of freedom 
is preferable, since in general simpler models are better unless the data 
provides evidence in support of a more complex model. 


7.6 Local Regression 

Local regression is a different approach for fitting flexible non-linear func¬ 
tions, which involves computing the fit at a target point Xq using only the 
nearby training observations. Figure 7.9 illustrates the idea on some simu¬ 
lated data, with one target point near 0.4, and another near the boundary 
at 0.05. In this figure the blue line represents the function /( x) from which 
the data were generated, and the light orange line corresponds to the local 
regression estimate f{x). Local regression is described in Algorithm 7.1. 

Note that in Step 3 of Algorithm 7.1, the weights K^ will differ for each 
value of Xq. In other words, in order to obtain the local regression fit at a 
new point, we need to fit a new weighted least squares regression model by 
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Local Regression 



FIGURE 7.9. Local regression illustrated on some simulated data, where the 
blue curve represents f(x) from which the data were generated, and the light 
orange curve corresponds to the local regression estimate f{x). The orange colored 
points are local to the target point xo, represented by the orange vertical line. 
The yellow bell-shape superimposed on the plot indicates weights assigned to each 
point, decreasing to zero with distance from the target point. The fit f(x o) at xo is 
obtained by fitting a weighted linear regression (orange line segment), and using 
the fitted value at xo (orange solid dot) as the estimate f(x o). 


minimizing (7.14) for a new set of weights. Local regression is sometimes 
referred to as a memory-based procedure, because like nearest-neighbors, we 
need all the training data each time we wish to compute a prediction. We 
will avoid getting into the technical details of local regression here—there 
are books written on the topic. 

In order to perform local regression, there are a number of choices to be 
made, such as how to define the weighting function K , and whether to fit 
a linear, constant, or quadratic regression in Step 3 above. (Equation 7.14 
corresponds to a linear regression.) While all of these choices make some 
difference, the most important choice is the span s, defined in Step 1 above. 
The span plays a role like that of the tuning parameter A in smoothing 
splines: it controls the flexibility of the non-linear fit. The smaller the value 
of s, the more local and wiggly will be our fit; alternatively, a very large 
value of s will lead to a global fit to the data using all of the training 
observations. We can again use cross-validation to choose s, or we can 
specify it directly. Figure 7.10 displays local linear regression fits on the 
Wage data, using two values of s: 0.7 and 0.2. As expected, the fit obtained 
using s = 0.7 is smoother than that obtained using s = 0.2. 

The idea of local regression can be generalized in many different ways. 
In a setting with multiple features X \, X ^,..., X p , one very useful general¬ 
ization involves fitting a multiple linear regression model that is global in 
some variables, but local in another, such as time. Such varying coefficient 
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Algorithm 7.1 Local Regression At X = Xq 

1. Gather the fraction s = k/n of training points whose Xi are closest 
to Xq. 

2. Assign a weight Ki o = K(xi,Xo) to each point in this neighborhood, 
so that the point furthest from xo has weight zero, and the closest 
has the highest weight. All but these k nearest neighbors get weight 
zero. 

3. Fit a weighted least squares regression of the yi on the Xi using the 
aforementioned weights, by finding /3 q and /3\ that minimize 

n 

£^o(yi-/?o-M 2 - (7.14) 

2 — 1 

4. The fitted value at Xq is given by f(x o) = Po + fiixo- 


models are a useful way of adapting a model to the most recently gathered 
data. Local regression also generalizes very naturally when we want to fit 
models that are local in a pair of variables X\ and X%, rather than one. 
We can simply use two-dimensional neighborhoods, and fit bivariate linear 
regression models using the observations that are near each target point 
in two-dimensional space. Theoretically the same approach can be imple¬ 
mented in higher dimensions, using linear regressions fit to p-dimensional 
neighborhoods. However, local regression can perform poorly if p is much 
larger than about 3 or 4 because there will generally be very few training 
observations close to Xq- Nearest-neighbors regression, discussed in Chap¬ 
ter 3, suffers from a similar problem in high dimensions. 


7.7 Generalized Additive Models 

In Sections 7.1-7.6, we present a number of approaches for flexibly predict¬ 
ing a response Y on the basis of a single predictor A". These approaches can 
be seen as extensions of simple linear regression. Here we explore the prob¬ 
lem of flexibly predicting Y on the basis of several predictors, Xi ,..., X p . 
This amounts to an extension of multiple linear regression. 

Generalized additive models (GAMs) provide a general framework for 
extending a standard linear model by allowing non-linear functions of each 
of the variables, while maintaining additivity. Just like linear models, GAMs 
can be applied with both quantitative and qualitative responses. We first 
examine GAMs for a quantitative response in Section 7.7.1, and then for a 
qualitative response in Section 7.7.2. 
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FIGURE 7.10. Local linear jits to the Wage data. The span specifies the fraction 
of the data used to compute the fit at each target point. 


7 . 7 . 1 GAMs for Regression Problems 

A natural way to extend the multiple linear regression model 

Vi = fio + PlXil + 02Xi2 + ‘ ’ ’ + PpXip + U 

in order to allow for non-linear relationships between each feature and the 
response is to replace each linear component /3jXij with a (smooth) non¬ 
linear function fj{xij). We would then write the model as 

v 

Vi = Po + ^2fj(Xij) +6i 

3 = i 

— Po + fl{ x il) + f2(Xi2) + ■ ■ ■ + fp(Xi p ) + Cj. (7-15) 

This is an example of a GAM. It is called an additive model because we 
calculate a separate fj for each X 3 . and then add together all of their 
contributions. 

In Sections 7.1-7.6, we discuss many methods for fitting functions to a 
single variable. The beauty of GAMs is that we can use these methods 
as building blocks for fitting an additive model. In fact, for most of the 
methods that we have seen so far in this chapter, this can be done fairly 
trivially. Take, for example, natural splines, and consider the task of fitting 
the model 


wage = Po + /i(year) + / 2 ( age) + / 3 (education) + e 


(7.16) 
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FIGURE 7.11. For the Wage data, plots of the relationship between each feature 
and the response, wage, in the fitted model (7.16). Each plot displays the fitted 
function and pointwise standard errors. The first two functions are natural splines 
in year and age, with four and five degrees of freedom, respectively. The third 
function is a step function, fit to the qualitative variable education. 


on the Wage data. Here year and age are quantitative variables, and 
education is a qualitative variable with five levels: <HS, HS, <Coll, Coll, 

>Coll, referring to the amount of high school or college education that 
an individual has completed. We fit the first two functions using natural 
splines. We fit the third function using a separate constant for each level, 
via the usual dummy variable approach of Section 3.3.1. 

Figure 7.11 shows the results of fitting the model (7.16) using least 
squares. This is easy to do, since as discussed in Section 7.4, natural splines 
can be constructed using an appropriately chosen set of basis functions. 

Hence the entire model is just a big regression onto spline basis variables 
and dummy variables, all packed into one big regression matrix. 

Figure 7.11 can be easily interpreted. The left-hand panel indicates that 
holding age and education fixed, wage tends to increase slightly with year; 
this may be due to inflation. The center panel indicates that holding 
education and year fixed, wage tends to be highest for intermediate val¬ 
ues of age, and lowest for the very young and very old. The right-hand 
panel indicates that holding year and age fixed, wage tends to increase 
with education: the more educated a person is, the higher their salary, on 
average. All of these findings are intuitive. 

Figure 7.12 shows a similar triple of plots, but this time /i and /2 are 
smoothing splines with four and five degrees of freedom, respectively. Fit¬ 
ting a GAM with a smoothing spline is not quite as simple as fitting a GAM 
with a natural spline, since in the case of smoothing splines, least squares 
cannot be used. However, standard software such as the gam() function in R 
can be used to fit GAMs using smoothing splines, via an approach known 
as backfittinq. This method fits a model involving multiple predictors by 

backntting 
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FIGURE 7.12. Details are as in Figure 7.11, but now /i and /2 are smoothing 
splines with four and five degrees of freedom, respectively. 

repeatedly updating the fit for each predictor in turn, holding the others 
fixed. The beauty of this approach is that each time we update a function, 
we simply apply the fitting method for that variable to a partial residual. 6 

The fitted functions in Figures 7.11 and 7.12 look rather similar. In most 
situations, the differences in the GAMs obtained using smoothing splines 
versus natural splines are small. 

We do not have to use splines as the building blocks for GAMs: we can 
just as well use local regression, polynomial regression, or any combination 
of the approaches seen earlier in this chapter in order to create a GAM. 
GAMs are investigated in further detail in the lab at the end of this chapter. 

Pros and Cons of GAMs 

Before we move on, let us summarize the advantages and limitations of a 
GAM. 

▲ GAMs allow us to fit a non-linear fj to each Xj, so that we can 
automatically model non-linear relationships that standard linear re¬ 
gression will miss. This means that we do not need to manually try 
out many different transformations on each variable individually. 

▲ The non-linear fits can potentially make more accurate predictions 
for the response Y. 

▲ Because the model is additive, we can still examine the effect of 
each Xj on Y individually while holding all of the other variables 
fixed. Hence if we are interested in inference, GAMs provide a useful 
representation. 


6 A partial residual for A 3 , for example, has the form n = yi — fi (xj 1 ) — J' 2 (A 2 ). 
If we know f\ and / 2 , then we can fit /a by treating this residual as a response in a 
non-linear regression on A' 3 . 
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▲ The smoothness of the function fj for the variable X 3 can be sum¬ 
marized via degrees of freedom. 

♦ The main limitation of GAMs is that the model is restricted to be 
additive. With many variables, important interactions can be missed. 
However, as with linear regression, we can manually add interaction 
terms to the GAM model by including additional predictors of the 
form Xj x A*,. In addition we can add low-dimensional interaction 
functions of the form fjk(Xj,Xk) into the model; such terms can 
be fit using two-dimensional smoothers such as local regression, or 
two-dimensional splines (not covered here). 

For fully general models, we have to look for even more flexible approaches 
such as random forests and boosting, described in Chapter 8. GAMs provide 
a useful compromise between linear and fully nonparametric models. 


1.1.2 GAMs for Classification Problems 

GAMs can also be used in situations where Y is qualitative. For simplicity, 
here we will assume Y takes on values zero or one, and let p(X) = Pr(H = 
1|X) be the conditional probability (given the predictors) that the response 
equals one. Recall the logistic regression model (4.6): 

log ^ i -~p( Y)) = + P 2 X 2 + • • • + P p X p . (7-17) 

This logit is the log of the odds of P(Y = 1|X) versus P{Y = 0|X), which 
(7.17) represents as a linear function of the predictors. A natural way to 
extend (7.17) to allow for non-linear relationships is to use the model 


log ( i! ( gn ) =/3o+fl{Xl) + h{X2) + "' + fp{Xp) ' (7,18) 

Equation 7.18 is a logistic regression GAM. It has all the same pros and 
cons as discussed in the previous section for quantitative responses. 

We fit a GAM to the Wage data in order to predict the probability that 
an individual’s income exceeds $250,000 per year. The GAM that we fit 
takes the form 

log 


( 1 P -^{X ) ) = + ^ X year + ^(age) + / 3 (education), (7.19) 


where 


p(X) = Pr(wage > 250|year, age, education). 
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FIGURE 7.13. For the Wage data, the logistic regression GAM given in (7.19) 
is fit to the binary response I (wage>250) . Each plot displays the fitted function 
and pointwise standard errors. The first function is linear in year, the second 
function a smoothing spline with five degrees of freedom in age, and the third a 
step function for education. There are very wide standard errors for the first 
level <HS of education. 


Once again fi is fit using a smoothing spline with five degrees of freedom, 
and /3 is fit as a step function, by creating dummy variables for each of 
the levels of education. The resulting fit is shown in Figure 7.13. The last 
panel looks suspicious, with very wide confidence intervals for level <HS. In 
fact, there are no ones for that category: no individuals with less than a 
high school education make more than $250,000 per year. Hence we refit 
the GAM, excluding the individuals with less than a high school education. 
The resulting model is shown in Figure 7.14. As in Figures 7.11 and 7.12, 
all three panels have the same vertical scale. This allows us to visually 
assess the relative contributions of each of the variables. We observe that 
age and education have a much larger effect than year on the probability 
of being a high earner. 


7.8 Lab: Non-linear Modeling 

In this lab, we re-analyze the Wage data considered in the examples through¬ 
out this chapter, in order to illustrate the fact that many of the complex 
non-linear fitting procedures discussed can be easily implemented in R. We 
begin by loading the ISLR library, which contains the data. 

> library(ISLR) 

> attach(Wage) 
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FIGURE 7.14. The same model is fit as in Figure 7.13, this time excluding the 
observations for which education is <HS. Now we see that increased education 
tends to be associated with higher salaries. 


7.8.1 Polynomial Regression and Step Functions 

We now examine how Figure 7.1 was produced. We first fit the model using 
the following command: 


> fit=lm(wage~poly(age,4),data=Wage) 

> coef(summary(fit)) 





Estimat e 

Std. Error 

t value 

Pr(>|1 

:|) 

(Intercept ) 


in. 

. 704 

0 

. 729 

153 . 

28 

<2e - 

-16 

poly(age, 

4) 

1 

447 . 

. 068 

39 . 

.915 

11 . 

. 20 

<2e - 

-16 

poly(age, 

4) 

2 

-478 

.316 

39 

.915 

-11 . 

. 98 

<2e - 

-16 

poly(age, 

4) 

3 

125 . 

. 522 

39 . 

.915 

3. 

. 14 

0 . 0017 

poly(age, 

4) 

4 

-77 

.911 

39 

.915 

-1 

. 95 

0.0E 

>10 


This syntax fits a linear model, using the lm() function, in order to predict 
wage using a fourth-degree polynomial in age: poly (age ,4) . The poly () com¬ 
mand allows us to avoid having to write out a long formula with powers 
of age. The function returns a matrix whose columns are a basis of or¬ 
thogonal polynomials, which essentially means that each column is a linear 
combination of the variables age, age~2, age~3 and age~4. 

However, we can also use polyO to obtain age, age~2, age~3 and age~4 
directly, if we prefer. We can do this by using the raw=TRUE argument to 
the polyO function. Later we see that this does not affect the model in a 
meaningful way—though the choice of basis clearly affects the coefficient 
estimates, it does not affect the fitted values obtained. 

> fit2=lm(wage~poly(age, 4 > raw = T),data=Wage) 

> coef(summary(fit2) ) 

Estimate Std . Error t value Pr(>It I ) 

(Intercept) -1.84e+02 6.00e+01 -3.07 0.002180 

poly(age, 4, raw = T)1 2.12e+01 5.89e+00 3.61 0.000312 

poly(age, 4, raw = T)2 -5.64e-01 2.06e-01 -2.74 0.006261 


orthogonal 

polynomial 
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poly(age, 4, raw = T)3 6.81e-03 3.0Te-03 2.22 0.026398 

poly(age, 4, raw = T)4 -3.20e-05 1.64e-05 -1.95 0.051039 

There are several other equivalent ways of fitting this model, which show¬ 
case the flexibility of the formula language in R. For example 

> f it 2a = lm (wage~age +1 (age ~2)+1 (age "3) +1 (age ~4),data=Wage) 

> coef(fit2a) 

(Intercept) age I(age“2) I(age~3) I(age~4) 

-1.84e+02 2.12e+01 -5.64e-01 6.81e-03 -3.20e-05 

This simply creates the polynomial basis functions on the fly, taking care 
to protect terms like age “2 via the wrapper function I() (the “ symbol has 
a special meaning in formulas). 

> fit2b=lm(wage~cbind(age,age“2,age"3,age"4),data=Wage) 

This does the same more compactly, using the cbindO function for building 
a matrix from a collection of vectors; any function call such as cbindO inside 
a formula also serves as a wrapper. 

We now create a grid of values for age at which we want predictions, and 
then call the generic predict () function, specifying that we want standard 
errors as well. 

> agelims=range(age) 

> age.grid=seq(from=agelims [1] ,to = agelims [2]) 

> preds = predict(fit,newdata = list(age = age.grid) ,se = TRUE) 

> se.bands = cbind(preds$fit+2*preds$se.fit ,preds$fit-2*preds$se . 

f it) 

Finally, we plot the data and add the fit from the degree-4 polynomial. 

> par(mfrow = c(1,2) ,mar = c(4.5,4.5,1,1) ,oma = c(0,0,4,0)) 

> plot(age.wage,xlim=agelims ,cex=.5,col="darkgrey") 

> title("Degree-4 Polynomial ", outer=T) 

> lines(age.grid,preds$fit ,lwd = 2,col = "blue") 

> matlines(age.grid.se.bands,lwd=l,col="blue",lty=3) 

Here the mar and oma arguments to par() allow us to control the margins 
of the plot, and the title () function creates a figure title that spans both 
subplots. 

We mentioned earlier that whether or not an orthogonal set of basis func¬ 
tions is produced in the polyO function will not affect the model obtained 
in a meaningful way. What do we mean by this? The fitted values obtained 
in either case are identical: 

> preds2=predict(fit2,newdata = list(age = age.grid),se = TRUE) 

> max(abs(preds$fit-preds2$fit )) 

[1] T . 39e -13 

In performing a polynomial regression we must decide on the degree of 
the polynomial to use. One way to do this is by using hypothesis tests. We 
now fit models ranging from linear to a degree-5 polynomial and seek to 
determine the simplest model which is sufficient to explain the relationship 


wrapper 


titleQ 
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between wage and age. We use the anovaO function, which performs an 
analysis of variance (ANOVA, using an F-test) in order to test the null 
hypothesis that a model A4i is sufficient to explain the data against the 
alternative hypothesis that a more complex model AM is required. In order 
to use the anovaO function, AM and M 2 must be nested models: the 
predictors in M\ must be a subset of the predictors in AM- In this case, 
we fit five different models and sequentially compare the simpler model to 
the more complex model. 

> fit.1=lm(wage~age,data=Wage) 

> fit.2=lm(wage^poly(age ,2) ,data = Wage) 

> fit.3=lm(wage^poly(age,3),data=Wage) 

> fit.4=lm(wage~poly(age,4),data=Wage) 

> fit.5=lm(wage^poly(age,5),data=Wage) 

> anova(fit.l,fit.2,fit.3,fit.4,fit.5) 

Analysis of Variance Table 


Model 

i 

wage 

~ age 


Model 

2 

wage 

~ poly(age , 

2) 

Model 

3 

wage 

~ poly(age , 

3) 

Model 

4 

wage 

~ poly(age, 

4) 

Model 

5 

wage 

~ poly(age , 

5) 


1 

2 

3 

4 

5 


Res . Df RSS Df Sum of Sq 
2998 5022216 

2997 4793430 1 228786 
2996 4777674 1 15756 
2995 4771604 1 6070 
2994 4770322 1 1283 


F Pr(>F) 

143.59 <2e-16 *** 
9.89 0.0017 ** 
3.81 0.0510 . 
0.80 0.3697 


Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 0.05 


0.1 


1 


The p-value comparing the linear Model 1 to the quadratic Model 2 is 
essentially zero (<10 -15 ), indicating that a linear fit is not sufficient. Sim¬ 
ilarly the p-value comparing the quadratic Model 2 to the cubic Model 3 
is very low (0.0017), so the quadratic fit is also insufficient. The p-value 
comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is ap¬ 
proximately 5 % while the degree-5 polynomial Model 5 seems unnecessary 
because its p-value is 0.37. Hence, either a cubic or a quartic polynomial 
appear to provide a reasonable fit to the data, but lower- or higher-order 
models are not justified. 

In this case, instead of using the anovaO function, we could have obtained 
these p-values more succinctly by exploiting the fact that polyO creates 
orthogonal polynomials. 


> coef(summary(fit . 5)) 

Estimate Std . Error t value Pr(>It I ) 
(Intercept) 111.70 0.7288 153.2780 0.000e+00 
poly(age, 5)1 447.07 39.9161 11.2002 1.491e-28 
poly(age, 5)2 -478.32 39.9161 -11.9830 2.368e-32 
poly(age, 5)3 125.52 39.9161 3.1446 1.679e-03 


anovaO 

analysis of 
variance 



poly (age , 5)4 
poly(age, 5)5 


-77.91 
-35.81 
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39.9161 -1.9519 5.105e-02 
39.9161 -0.8972 3.697e-01 


Notice that the p-values are the same, and in fact the square of the 
t-statistics are equal to the F-statistics from the anovaO function; for 
example: 

> (-11.983)“2 

[1] 143.6 

However, the ANOVA method works whether or not we used orthogonal 
polynomials; it also works when we have other terms in the model as well. 
For example, we can use anovaO to compare these three models: 

> f it . 1= lm (wage^educat ion +age , data = Wage) 

> fit.2=lm(wage~education+poly(age ,2) ,data = Wage) 

> fit.3=lm(wage~education+poly(age,3),data=Wage) 

> anova (f it . 1 , f it . 2 , f it . 3) 

As an alternative to using hypothesis tests and ANOVA, we could choose 
the polynomial degree using cross-validation, as discussed in Chapter 5. 

Next we consider the task of predicting whether an individual earns more 
than $250,000 per year. We proceed much as before, except that first we 
create the appropriate response vector, and then apply the glm() function 
using family="binomial" in order to fit a polynomial logistic regression 
model. 

> fit = glm(I(wage >250)~poly(age ,4) ,data = Wage,family = binomial) 

Note that we again use the wrapper I() to create this binary response 
variable on the fly. The expression wage>250 evaluates to a logical variable 
containing TRUEs and FALSEs, which glm() coerces to binary by setting the 
TRUEs to 1 and the FALSEs to 0. 

Once again, we make predictions using the predict () function. 

> preds=predict(fit,newdata=list(age = age.grid),se=T) 

However, calculating the confidence intervals is slightly more involved than 
in the linear regression case. The default prediction type for a glm() model 
is type="link", which is what we use here. This means we get predictions 
for the logit: that is, we have fit a model of the form 



and the predictions given are of the form X0. The standard errors given are 
also of this form. In order to obtain confidence intervals for Pr(T = 1|A), 
we use the transformation 
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> pfit=exp(preds$fit)/(l+exp(preds$fit)) 

> se.bands.logit = cbind(preds$fit+2*preds$se.fit, preds$fit-2* 

preds$se . fit) 

> se.bands = exp (se.bands.logit)/(1 +exp(se.bands.logit)) 

Note that we could have directly computed the probabilities by selecting 
the type="response" option in the predict () function. 

> preds=predict(fit,newdata = list(age = age .grid) ,type = "response", 

se =T) 

However, the corresponding confidence intervals would not have been sen¬ 
sible because we would end up with negative probabilities! 

Finally, the right-hand plot from Figure 7.1 was made as follows: 

> plot(age,I(wage >250) ,xlim = agelims ,type = "n",ylim = c(0 , .2) ) 

> points(jitter(age), I((wage>250)/5),cex=.5,pch="I", 

col = "darkgrey " ) 

> lines(age.grid,pfit,lwd=2 , col = "blue") 

> matlines(age.grid,se.bands,lwd=l,col="blue",lty=3) 

We have drawn the age values corresponding to the observations with wage 
values above 250 as gray marks on the top of the plot, and those with wage 
values below 250 are shown as gray marks on the bottom of the plot. We 
used the jitter () function to jitter the age values a bit so that observations 
with the same age value do not cover each other up. This is often called a 
rug plot. 

In order to fit a step function, as discussed in Section 7.2, we use the 
cut() function. 

> t able ( cut (age , 4) ) 


(17.9,33 

.5] 

(33.5,49] 

(49,64 

. 5] 

(64 . 

.5,80 

. 1] 






750 

1399 


779 



72 





> fit=lm 

(wage~ 

cut(age ,4) 

,data=Wage) 








> coef (s 

ummary 

(fit)) 












E 

stimate 

Std . 

Eri 

:or t 

value 

Pr (> | 

tl) 

(Int erce; 

pt ) 


94.16 


1 

.48 

63 . 

. 79 

0 . 

. 00e 

+ 00 

cut(age, 

4) (33 

.5,49] 

24.05 


1 . 

. 83 

13 . 

. 15 

1 

. 98e 

-38 

cut(age, 

4) (49 

,64.5] 

23.66 


2 . 

. 07 

11 . 

.44 

1 . 

. 04e 

-29 

cut(age, 

4) (64 

.5,80.1] 

7.64 


4. 

. 99 

1 

. 53 

1 

. 26e 

-01 


Here cut() automatically picked the cutpoints at 33.5, 49, and 64.5 years 
of age. We could also have specified our own cutpoints directly using the 
breaks option. The function cut() returns an ordered categorical variable; 
the lm() function then creates a set of dummy variables for use in the re¬ 
gression. The age<33.5 category is left out, so the intercept coefficient of 
$94,160 can be interpreted as the average salary for those under 33.5 years 
of age, and the other coefficients can be interpreted as the average addi¬ 
tional salary for those in the other age groups. We can produce predictions 
and plots just as we did in the case of the polynomial fit. 


jitter() 
rug plot 
cut () 
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7.8.2 Splines 

In order to fit regression splines in R, we use the splines library. In Section 
7.4, we saw that regression splines can be fit by constructing an appropriate 
matrix of basis functions. The bs() function generates the entire matrix of 
basis functions for splines with the specified set of knots. By default, cubic 
splines are produced. Fitting wage to age using a regression spline is simple: 

> library(splines) 

> fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage) 

> pred=predict(fit,newdata=list(age=age.grid),se=T) 

> plot (age , wage , col = " gray " ) 

> lines(age.grid,pred$fit,lwd=2) 

> lines (age . grid , pred$f it +2* pred$se , lty = " dashed " ) 

> lines (age . grid , pred$f it-2* pred$se , lty = " dashed " ) 

Here we have prespecified knots at ages 25, 40, and 60. This produces a 
spline with six basis functions. (Recall that a cubic spline with three knots 
has seven degrees of freedom; these degrees of freedom are used up by an 
intercept, plus six basis functions.) We could also use the df option to 
produce a spline with knots at uniform quantiles of the data. 

> dim (bs (age , knot s = c (25,40,60) )) 

[1] 3000 6 

> dim ( bs ( age , df =6) ) 

[1] 3000 6 

> attr(bs(age,df=6) ,"knots") 

25 •/. 507. 75% 

33.8 42.0 51.0 

In this case R chooses knots at ages 33.8,42.0, and 51.0, which correspond 
to the 25th, 50th, and 75th percentiles of age. The function bs() also has 
a degree argument, so we can fit splines of any degree, rather than the 
default degree of 3 (which yields a cubic spline). 

In order to instead fit a natural spline, we use the ns() function. Here 
we fit a natural spline with four degrees of freedom. 

> fit2 = lm(wage~ns(age,df =4) ,data = Wage) 

> pred2=predict(fit2,newdata=list(age=age.grid),se=T) 

> lines(age.grid, pred2$fit,col="red",lwd=2) 

As with the bs() function, we could instead specify the knots directly using 
the knots option. 

In order to fit a smoothing spline, we use the smooth.spline() function. 
Figure 7.8 was produced with the following code: 

> plot (age , wage , xlim = agelims , cex = . 5 , col = "darkgrey") 

> title("Smoothing Spline") 

> fit = smooth.spline(age,wage,df = 16) 

> fit2=smooth.spline(age,wage,cv=TRUE) 

> fit2$df 
[1] 6.8 

> lines(fit ,col = "red ",lwd = 2) 


bs () 


ns () 


smooth, 
spline() 
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> lines(fit2,col="blue",lwd=2) 

> legend("topright",legend = c (" 16 DF","6.8 DF"), 

col = c("red"."blue") ,lty = l,lwd=2,cex=.8) 

Notice that in the first call to smooth.spline(), we specified df=16. The 
function then determines which value of A leads to 16 degrees of freedom. In 
the second call to smooth, spline(), we select the smoothness level by cross- 
validation; this results in a value of A that yields 6.8 degrees of freedom. 
In order to perform local regression, we use the loess () function. 

> plot (age , wage , xlim = agelims , cex = . 5 , col = "darkgrey") 

> title("Local Regression") 

> fit=loess(wage~age,span=.2,data=Wage) 

> fit2=loess(wage~age,span=.5,data=Wage) 

> lines(age.grid.predict(fit.data.frame(age=age.grid)), 

col="red",lwd=2) 

> lines(age.grid.predict(fit2.data.frame(age = age.grid)) , 

col="blue",lwd=2) 

> legend("topright",legend=c("Span=0.2","Span=0.5"), 

col = c("red"."blue") ,lty = l,lwd=2,cex = .8) 

Here we have performed local linear regression using spans of 0.2 and 0.5: 
that is, each neighborhood consists of 20 % or 50 % of the observations. The 
larger the span, the smoother the fit. The locfit library can also be used 
for fitting local regression models in R. 


7.8.3 GAMs 

We now fit a GAM to predict wage using natural spline functions of year 
and age, treating education as a qualitative predictor, as in (7.16). Since 
this is just a big linear regression model using an appropriate choice of 
basis functions, we can simply do this using the lm() function. 

> gaml = lm(wage~ns(year ,4)+ns (age ,5)+education,data = Wage) 

We now fit the model (7.16) using smoothing splines rather than natural 
splines. In order to fit more general sorts of GAMs, using smoothing splines 
or other components that cannot be expressed in terms of basis functions 
and then fit using least squares regression, we will need to use the gam 
library in R. 

The s() function, which is part of the gam library, is used to indicate that 
we would like to use a smoothing spline. We specify that the function of 
year should have 4 degrees of freedom, and that the function of age will 
have 5 degrees of freedom. Since education is qualitative, we leave it as is, 
and it is converted into four dummy variables. We use the gam() function in 
order to fit a GAM using these components. All of the terms in (7.16) are 
fit simultaneously, taking each other into account to explain the response. 

> library(gam) 

> gam.m3 = gam(wagers (year,4)+s(age,5)+education,data = Wage) 


loessQ 


s() 


gam() 
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In order to produce Figure 7.12, we simply call the plotO function: 

> par(mfrow=c(1,3)) 

> plot(gam.m3, se = TRUE,col = "blue " ) 

The generic plot () function recognizes that gam.m3 is an object of class gam, 
and invokes the appropriate plot .gam() method. Conveniently, even though ^ ^ 

garni is not of class gam but rather of class lm, we can still use plot.gam0 
on it. Figure 7.11 was produced using the following expression: 

> plot.gam(garni, se=TRUE, col="red") 

Notice here we had to use plot.gamO rather than the generic plotO 
function. 

In these plots, the function of year looks rather linear. We can perform a 
series of ANOVA tests in order to determine which of these three models is 
best: a GAM that excludes year {Mi), a GAM that uses a linear function 
of year {M 2 ), or a GAM that uses a spline function of year {M 3 ). 

> gam . ml = gam (wagers (age ,5)+ e due at ion ,data=Wage) 

> gam.m2=gam(wage~year+s(age,5)+education,data=Wage) 

> anova ( gam . ml , gam . m2 , gam . m3 , t est = " F " ) 

Analysis of Deviance Table 


Model 1 

: wage 

~ s(age , 

5) + education 


Model 2 

: wage 

~ year + 

s(age , 5) + 

education 

Model 3 

: wage 

~ s(year 

4) + s (age 

5) + 

education 

Resid 

. Df Resid . Dev 

Df Deviance 

F 

Pr(>F) 

1 

2990 

3711730 




2 

2989 

3693841 

1 17889 

14.5 

0.00014 *** 

3 

2986 

3689770 

3 4071 

1. 1 

0.34857 

Signif . 

codes 

0 ’ *** 

0.001 } ** ’ 

0.01 

0.05 ■ . 


We find that there is compelling evidence that a GAM with a linear func¬ 
tion of year is better than a GAM that does not include year at all 
(p-value = 0.00014). However, there is no evidence that a non-linear func¬ 
tion of year is needed (p-value = 0.349). In other words, based on the results 
of this ANOVA, A4 2 is preferred. 

The summary () function produces a summary of the gam fit. 

> summary ( gam . m3 ) 

Call: gam (f ormula = wage ~ s (year , 4) + s (age , 5) + education, 
data = Wage) 

Deviance Residuals: 

Min IQ Median 3Q Max 

-119.43 -19.70 -3.33 14.17 213.48 

(Dispersion Parameter for gaussian family taken to be 1236) 

Null Deviance : 5222086 on 2999 degrees of freedom 
Residual Deviance : 3689770 on 2986 degrees of freedom 
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AIC: 29888 


Number of Local Scoring Iterations: 2 


DF for Terms and F-values for Nonparametric Effects 


(Intercept) 
s(year , 4) 
s (age , 5) 
education 


Df Npar Df Npar F Pr(F) 

1 

1 3 1.1 0.35 

1 4 32.4 <2e-16 *** 

4 


Signif. codes: 0 >***> 0.001 •**’ 0.01 0.05 0.1 > ' 1 

The p-values for year and age correspond to a null hypothesis of a linear 
relationship versus the alternative of a non-linear relationship. The large 
p-value for year reinforces our conclusion from the ANOVA test that a lin¬ 
ear function is adequate for this term. However, there is very clear evidence 
that a non-linear term is required for age. 

We can make predictions from gam objects, just like from lm objects, 
using the predict () method for the class gam. Here we make predictions on 
the training set. 

> preds=predict(gam.m2,newdata=Wage) 

We can also use local regression fits as building blocks in a GAM, using 
the lo() function. 

lo() 

> gam.lo = gam(wage~s (year,df=4)+lo(age,span=0.7)+ edueat ion, 

data=Wage) 

> plot.gam(gam.lo, se=TRUE, col="green") 

Here we have used local regression for the age term, with a span of 0.7. 

We can also use the lo() function to create interactions before calling the 
gam() function. For example, 

> gam.lo.i=gam(wage~lo(year,age,span=0.5)+education, 

data=Wage) 

fits a two-term model, in which the first term is an interaction between 
year and age, fit by a local regression surface. We can plot the resulting 
two-dimensional surface if we first install the akima package. 

> library(akima) 

> plot (gam . lo . i) 

In order to fit a logistic regression GAM, we once again use the I() func¬ 
tion in constructing the binary response variable, and set family=binomial. 

> gam.lr = gam(I(wage >25 0)~year + s(age,df=5)+education, 

family=binomial,data=Wage) 

> par(mfrow=c(1,3)) 

> plot(gam.lr,se=T,col="green") 
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It is easy to see that there are no high earners in the <HS category: 

> table(education,I(wage>250)) 
education FALSE TRUE 


1. 

< HS Grad 

268 

0 

2. 

HS Grad 

966 

5 

3. 

Some College 

643 

7 

4. 

College Grad 

663 

22 

5. 

Advanced Degree 

381 

45 


Hence, we fit a logistic regression GAM using all but this category. This 
provides more sensible results. 

> gam . lr . s = gam (I (wage >250)~year+s (age ,df=5)+education ,family = 

binomial ,data = Wage,subset = (education !=" 1 . < HS Grad")) 

> plot(gam.lr.s,se=T,col="green") 


7.9 Exercises 

Conceptual 

1. It was mentioned in the chapter that a cubic regression spline with 
one knot at £ can be obtained using a basis of the form x, x 2 , x , 
(x — £) 3 , where (x — £) 3 = (x — £) 3 if x > £ and equals 0 otherwise. 
We will now show that a function of the form 

f{x) = fo + fox + fox 2 + fox 3 + fo{x - C+ 

is indeed a cubic regression spline, regardless of the values of fo,fo,fo, 

fo i fo • 


(a) Find a cubic polynomial 

fi(x) = ai + b\x + Cix 2 + d\X 3 

such that f(x) = fi(x) for all x < Express ai,&i,ci,di in 
terms of fo, fo, fo, fo, fo. 

(b) Find a cubic polynomial 

/ 2 (x) = a 2 + b 2 x + c 2 x 2 + d 2 x 3 

such that f(x) = f 2 (x) for all x > £. Express a 2 ,6 2 ,c 2 ,d 2 in 
terms of fo, fo, fo, fo, fo. We have now established that f{x ) is 
a piecewise polynomial. 

(c) Show that fi(Q = f 2 { 0 - That is, f(x) is continuous at £. 

(d) Show that /{(^) = f 2 (t;)- That is, f'(x) is continuous at 
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(e) Show that /"(£) = f 2 (0- That is, f"(x) is continuous at £. 
Therefore, f(x) is indeed a cubic spline. 

Hint: Parts (d) and (e) of this problem require knowledge of single¬ 
variable calculus. As a reminder, given a cubic polynomial 

fi(x) = a\ + b\x + C\x 2 + d\X 3 , 

the first derivative takes the form 

f[(x) = b\ + 2 cix + 3dix 2 

and the second derivative takes the form 

fi(x) = 2ci + 6d\X. 


2. Suppose that a curve g is computed to smoothly fit a set of n points 
using the following formula: 


g = arg mm 
g 


]{Vi ~ 9 {xi)f + A / g {m \x) 


dx 


where g( m ^ represents the mth derivative of g (and = g). Provide 
example sketches of g in each of the following scenarios. 


(a) A = oo, m — 0. 

(b) A = oo, to = 1. 

(c) A = oo, m = 2. 

(d) A = oo, to = 3. 

(e) A = 0, to = 3. 


3. Suppose we fit a curve with basis functions b\(X) = X , 62 (X) = 
(.X — 1 ) 2 I{X > 1). (Note that I(X > 1) equals 1 for X > 1 and 0 
otherwise.) We fit the linear regression model 

Y = f3 0 + Pibi(X) +/3 2 b 2 (X) + e, 

and obtain coefficient estimates $0 = l,/3i = l ,/?2 = —2. Sketch the 
estimated curve between X = —2 and X = 2. Note the intercepts, 
slopes, and other relevant information. 


4. Suppose we fit a curve with basis functions b\(X) = 1(0 < X < 2) — 
{X - 1)/(1 < X < 2), b 2 {X) = (X - 3)1(3 < X < 4) + 1(4 < X < 5). 
We fit the linear regression model 

Y = /3o+Pibi(X) + fob 2 (X)+e, 

and obtain coefficient estimates $0 = = 1 ,$2 = 3. Sketch the 

estimated curve between X = —2 and X = 2. Note the intercepts, 
slopes, and other relevant information. 
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5. Consider two curves, gi and g 2 , defined by 

9 i = argmin ^(y, - g(xi )) 2 + A / g { 3 \x) 


\i=l 


dx ] , 


g 2 = argmin I ^T{yi ~ g{xi )) 2 + A / g (i) {x) dx j , 


vi=1 


where <?( m ) represents the mth derivative of g. 


(a) As A —> 00 , will g\ or g 2 have the smaller training RSS? 

(b) As A —> 00 , will g 1 or g 2 have the smaller test RSS? 

(c) For A = 0, will gi or g 2 have the smaller training and test RSS? 


Applied 

6 . In this exercise, you will further analyze the Wage data set considered 
throughout this chapter. 

(a) Perform polynomial regression to predict wage using age. Use 
cross-validation to select the optimal degree d for the polyno¬ 
mial. What degree was chosen, and how does this compare to 
the results of hypothesis testing using ANOVA? Make a plot of 
the resulting polynomial fit to the data. 

(b) Fit a step function to predict wage using age, and perform cross- 
validation to choose the optimal number of cuts. Make a plot of 
the fit obtained. 

7. The Wage data set contains a number of other features not explored 
in this chapter, such as marital status (maritl), job class (jobclass), 
and others. Explore the relationships between some of these other 
predictors and wage, and use non-linear fitting techniques in order to 
fit flexible models to the data. Create plots of the results obtained, 
and write a summary of your findings. 

8 . Fit some of the non-linear models investigated in this chapter to the 
Auto data set. Is there evidence for non-linear relationships in this 
data set? Create some informative plots to justify your answer. 

9. This question uses the variables dis (the weighted mean of distances 
to five Boston employment centers) and nox (nitrogen oxides concen¬ 
tration in parts per 10 million) from the Boston data. We will treat 
dis as the predictor and nox as the response. 

(a) Use the polyO function to fit a cubic polynomial regression to 
predict nox using dis. Report the regression output, and plot 
the resulting data and polynomial fits. 
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(b) Plot the polynomial fits for a range of different polynomial 
degrees (say, from 1 to 10), and report the associated residual 
sum of squares. 

(c) Perform cross-validation or another approach to select the opti¬ 
mal degree for the polynomial, and explain your results. 

(d) Use the bs() function to fit a regression spline to predict nox 
using dis. Report the output for the fit using four degrees of 
freedom. How did you choose the knots? Plot the resulting fit. 

(e) Now fit a regression spline for a range of degrees of freedom, and 
plot the resulting fits and report the resulting RSS. Describe the 
results obtained. 

(f) Perform cross-validation or another approach in order to select 
the best degrees of freedom for a regression spline on this data. 
Describe your results. 

10. This question relates to the College data set. 

(a) Split the data into a training set and a test set. Using out-of-state 
tuition as the response and the other variables as the predictors, 
perform forward stepwise selection on the training set in order 
to identify a satisfactory model that uses just a subset of the 
predictors. 

(b) Fit a GAM on the training data, using out-of-state tuition as 
the response and the features selected in the previous step as 
the predictors. Plot the results, and explain your findings. 

(c) Evaluate the model obtained on the test set, and explain the 
results obtained. 

(d) For which variables, if any, is there evidence of a non-linear 
relationship with the response? 

11. In Section 7.7, it was mentioned that GAMs are generally fit using 
a backfitting approach. The idea behind backfitting is actually quite 
simple. We will now explore backfitting in the context of multiple 
linear regression. 

Suppose that we would like to perform multiple linear regression, but 
we do not have software to do so. Instead, we only have software 
to perform simple linear regression. Therefore, we take the following 
iterative approach: we repeatedly hold all but one coefficient esti¬ 
mate fixed at its current value, and update only that coefficient 
estimate using a simple linear regression. The process is continued un¬ 
til convergence —that is, until the coefficient estimates stop changing. 

We now try this out on a toy example. 
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(a) Generate a response Y and two predictors A'i and X 2 , with 
n = 100. 

(b) Initialize j3\ to take on a value of your choice. It does not matter 
what value you choose. 

(c) Keeping f3\ fixed, fit the model 

Y - fi\Xi = A, + p 2 x 2 + e. 

You can do this as follows: 

> a=y-betal*xl 

> beta2=lm(a~x2)$coef[2] 

(d) Keeping /3 2 fixed, fit the model 

Y — P 2 X 2 = A) + /3\Xi + e. 

You can do this as follows: 

> a=y-beta2*x2 

> betal = lm(a~xl)$coef [2] 

(e) Write a for loop to repeat (c) and (d) 1,000 times. Report the 
estimates of $q, fii, and $2 at each iteration of the for loop. 
Create a plot in which each of these values is displayed, with /3o, 
Pi, and p 2 each shown in a different color. 

(f) Compare your answer in (e) to the results of simply performing 
multiple linear regression to predict Y using Ai and X 2 . Use 
the ablineO function to overlay those multiple linear regression 
coefficient estimates on the plot obtained in (e). 

(g) On this data set, how many backfitting iterations were required 
in order to obtain a “good” approximation to the multiple re¬ 
gression coefficient estimates? 

12. This problem is a continuation of the previous exercise. In a toy 
example with p = 100, show that one can approximate the multiple 
linear regression coefficient estimates by repeatedly performing simple 
linear regression in a backfitting procedure. How many backfitting 
iterations are required in order to obtain a “good” approximation to 
the multiple regression coefficient estimates? Create a plot to justify 
your answer. 


8 

Tree-Based Methods 


In this chapter, we describe tree-based, methods for regression and 
classification. These involve stratifying or segmenting the predictor space 
into a number of simple regions. In order to make a prediction for a given 
observation, we typically use the mean or the mode of the training observa¬ 
tions in the region to which it belongs. Since the set of splitting rules used 
to segment the predictor space can be summarized in a tree, these types of 
approaches are known as decision tree methods. 

Tree-based methods are simple and useful for interpretation. However, 
they typically are not competitive with the best supervised learning ap¬ 
proaches, such as those seen in Chapters 6 and 7, in terms of prediction 
accuracy. Hence in this chapter we also introduce bagging , random forests , 
and boosting. Each of these approaches involves producing multiple trees 
which are then combined to yield a single consensus prediction. We will 
see that combining a large number of trees can often result in dramatic 
improvements in prediction accuracy, at the expense of some loss in inter¬ 
pretation. 


8.1 The Basics of Decision Trees 

Decision trees can be applied to both regression and classification problems. 
We first consider regression problems, and then move on to classification. 


G. James et al., An Introduction to Statistical Learning: with Applications in R , 303 

Springer Texts in Statistics, DOI 10.1007/978-1-4614-7138-7—8, 
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Years, < 4.5 


Hits <117.5 


5.11 


FIGURE 8.1. For the Hitters data, a regression tree for predicting the log 
salary of a baseball player, based on the number of years that he has played in 
the major leagues and the number of hits that he made in the previous year. At a 
given internal node, the label (of the form Xj <tk) indicates the left-hand branch 
emanating from that split, and the right-hand branch corresponds to Xj > tk- 
For instance, the split at the top of the tree results in two large branches. The 
left-hand branch corresponds to Years<4.5, and the right-hand branch corresponds 
to Years>=4.5. The tree has two internal nodes and three terminal nodes, or 
leaves. The number in each leaf is the mean of the response for the observations 
that fall there. 


8.1.1 Regression Trees 

In order to motivate regression trees, we begin with a simple example. 

regression 

tree 

Predicting Baseball Players’ Salaries Using Regression Trees 

We use the Hitters data set to predict a baseball player’s Salary based on 
Years (the number of years that he has played in the major leagues) and 
Hits (the number of hits that he made in the previous year). We first remove 
observations that are missing Salary values, and log-transform Salary so 
that its distribution has more of a typical bell-shape. (Recall that Salary 
is measured in thousands of dollars.) 

Figure 8.1 shows a regression tree fit to this data. It consists of a series 
of splitting rules, starting at the top of the tree. The top split assigns 
observations having Years<4.5 to the left branch. 1 The predicted salary 


1 Hoth Years and Hits are integers in these data; the tree() function in R labels 
the splits at the midpoint between two adjacent values. 
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FIGURE 8.2. The three-region partition for the Hitters data set from the 
regression tree illustrated in Figure 8.1. 


for these players is given by the mean response value for the players in 
the data set with Years<4.5. For such players, the mean log salary is 5.107, 
and so we make a prediction of e 5107 thousands of dollars, i.e. $165,174, for 
these players. Players with Years>=4.5 are assigned to the right branch, and 
then that group is further subdivided by Hits. Overall, the tree stratifies 
or segments the players into three regions of predictor space: players who 
have played for four or fewer years, players who have played for five or more 
years and who made fewer than 118 hits last year, and players who have 
played for five or more years and who made at least 118 hits last year. These 
three regions can be written as R\ ={X | Years<4.5}, i ?2 ={X | Years>=4.5, 
Hits<117.5}, and R 3 ={X | Years>=4.5, Hits>=117. 5 }. Figure 8.2 illustrates 
the regions as a function of Years and Hits. The predicted salaries for these 
three groups are $l,OOOxe 5 - 107 =$165,174, $l,000xe 5 " 9 =$402,834, and 
$l,OOOxe 6 ' 740 =$845,346 respectively. 

In keeping with the tree analogy, the regions i?i, R 2 , and R 3 are known 
as terminal nodes or leaves of the tree. As is the case for Figure 8.1, decision 
trees are typically drawn upside down , in the sense that the leaves are at 
the bottom of the tree. The points along the tree where the predictor space 
is split are referred to as internal nodes. In Figure 8.1, the two internal 
nodes are indicated by the text Years<4.5 and Hits<117.5. We refer to the 
segments of the trees that connect the nodes as branches. 

We might interpret the regression tree displayed in Figure 8.1 as follows: 
Years is the most important factor in determining Salary, and players with 
less experience earn lower salaries than more experienced players. Given 
that a player is less experienced, the number of hits that he made in the 
previous year seems to play little role in his salary. But among players who 


terminal 

node 

leaf 

internal node 

branch 
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have been in the major leagues for five or more years, the number of hits 
made in the previous year does affect salary, and players who made more 
hits last year tend to have higher salaries. The regression tree shown in 
Figure 8.1 is likely an over-simplification of the true relationship between 
Hits, Years, and Salary. However, it has advantages over other types of 
regression models (such as those seen in Chapters 3 and 6): it is easier to 
interpret, and has a nice graphical representation. 

Prediction via Stratification of the Feature Space 

We now discuss the process of building a regression tree. Roughly speaking, 
there are two steps. 

1. We divide the predictor space—that is, the set of possible values for 
Xi, X 2 , ■ ■ ., X p — into J distinct and non-overlapping regions, 

Ri , R-2, ■ ■ ■ 1 Rj- 

2. For every observation that falls into the region Rj, we make the same 
prediction, which is simply the mean of the response values for the 
training observations in Rj. 

For instance, suppose that in Step 1 we obtain two regions, R± and i? 2 , 
and that the response mean of the training observations in the first region 
is 10, while the response mean of the training observations in the second 
region is 20. Then for a given observation X = x, if x £ R± we will predict 
a value of 10, and if x £ i ?2 we will predict a value of 20. 

We now elaborate on Step 1 above. How do we construct the regions 
Ri,...,Rj? In theory, the regions could have any shape. However, we 
choose to divide the predictor space into high-dimensional rectangles, or 
boxes , for simplicity and for ease of interpretation of the resulting predic¬ 
tive model. The goal is to find boxes Ri, . .., Rj that minimize the RSS, 
given by 


3 —1 i&Rj 

where y Rj is the mean response for the training observations within the 
jth box. Unfortunately, it is computationally infeasible to consider every 
possible partition of the feature space into J boxes. For this reason, we take 
a top-down, greedy approach that is known as recursive binary splitting. The 
approach is top-down because it begins at the top of the tree (at which point 
all observations belong to a single region) and then successively splits the 
predictor space; each split is indicated via two new branches further down 
on the tree. It is greedy because at each step of the tree-building process, 
the best split is made at that particular step, rather than looking ahead 
and picking a split that will lead to a better tree in some future step. 


recursive 

binary 

splitting 
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In order to perform recursive binary splitting, we first select the pre¬ 
dictor Xj and the cutpoint s such that splitting the predictor space into 
the regions {X\Xj < s} and {X\Xj > s} leads to the greatest possible 
reduction in RSS. (The notation {X\X 3 < s} means the region of predictor 
space in which Xj takes on a value less than s.) That is, we consider all 
predictors X \,..., X p , and all possible values of the cutpoint s for each of 
the predictors, and then choose the predictor and cutpoint such that the 
resulting tree has the lowest RSS. In greater detail, for any j and s, we 
define the pair of half-planes 

Ri{j, s) = {X\Xj < s} and R 2 {j, s) = {X\Xj > s}, (8.2) 

and we seek the value of j and s that minimize the equation 

(yi-VR J 2 + H ( 8 - 3 ) 

i: XiERi(j,s) i: Xi€R. 2 (j, s ) 

where y Rl is the mean response for the training observations in i?i(j, s), 
and y R2 is the mean response for the training observations in R 2 (j,s). 
Finding the values of j and s that minimize (8.3) can be done quite quickly, 
especially when the number of features p is not too large. 

Next, we repeat the process, looking for the best predictor and best 
cutpoint in order to split the data further so as to minimize the RSS within 
each of the resulting regions. However, this time, instead of splitting the 
entire predictor space, we split one of the two previously identified regions. 
We now have three regions. Again, we look to split one of these three regions 
further, so as to minimize the RSS. The process continues until a stopping 
criterion is reached; for instance, we may continue until no region contains 
more than five observations. 

Once the regions Ri ,..., Rj have been created, we predict the response 
for a given test observation using the mean of the training observations in 
the region to which that test observation belongs. 

A five-region example of this approach is shown in Figure 8.3. 

Tree Pruning 

The process described above may produce good predictions on the training 
set, but is likely to overfit the data, leading to poor test set performance. 
This is because the resulting tree might be too complex. A smaller tree 
with fewer splits (that is, fewer regions Ri,..., Rj) might lead to lower 
variance and better interpretation at the cost of a little bias. One possible 
alternative to the process described above is to build the tree only so long 
as the decrease in the RSS due to each split exceeds some (high) threshold. 
This strategy will result in smaller trees, but is too short-sighted since a 
seemingly worthless split early on in the tree might be followed by a very 
good split—that is, a split that leads to a large reduction in RSS later on. 
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FIGURE 8.3. Top Left: A partition of two-dimensional feature space that could 
not result from recursive binary splitting. Top Right: The output of recursive 
binary splitting on a two-dimensional example. Bottom Left: A tree corresponding 
to the partition in the top right panel. Bottom Right: A perspective plot of the 
prediction surface corresponding to that tree. 


Therefore, a better strategy is to grow a very large tree T 0j and then 
prune it back in order to obtain a subtree. How do we determine the best 
way to prune the tree? Intuitively, our goal is to select a subtree that 
leads to the lowest test error rate. Given a subtree, we can estimate its 
test error using cross-validation or the validation set approach. However, 
estimating the cross-validation error for every possible subtree would be too 
cumbersome, since there is an extremely large number of possible subtrees. 
Instead, we need a way to select a small set of subtrees for consideration. 

Cost complexity pruning —also known as weakest link pruning —gives us 
a way to do just this. Rather than considering every possible subtree, we 
consider a sequence of trees indexed by a nonnegative tuning parameter a. 


prune 

subtree 


cost 

complexity 

pruning 

weakest link 
pruning 
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Algorithm 8.1 Building a Regression Tree 

1. Use recursive binary splitting to grow a large tree on the training 
data, stopping only when each terminal node has fewer than some 
minimum number of observations. 

2. Apply cost complexity pruning to the large tree in order to obtain a 
sequence of best subtrees, as a function of a. 

3. Use K-fold cross-validation to choose a. That is, divide the training 
observations into K folds. For each k = 1,..., K: 

(a) Repeat Steps 1 and 2 on all but the kth fold of the training data. 

(b) Evaluate the mean squared prediction error on the data in the 
left-out fcth fold, as a function of a. 

Average the results for each value of a, and pick a to minimize the 
average error. 

4. Return the subtree from Step 2 that corresponds to the chosen value 


of a. 


For each value of a there corresponds a subtree T C To such that 


m 



(8.4) 


m= 1 i: Xi£Rm 


is as small as possible. Here \T\ indicates the number of terminal nodes 
of the tree T, R m is the rectangle (i.e. the subset of predictor space) cor¬ 
responding to the mth terminal node, and y Rm is the predicted response 
associated with R m —that is, the mean of the training observations in R m . 
The tuning parameter a controls a trade-off between the subtree’s com¬ 
plexity and its fit to the training data. When a = 0, then the subtree T 
will simply equal To, because then (8.4) just measures the training error. 
However, as a increases, there is a price to pay for having a tree with 
many terminal nodes, and so the quantity (8.4) will tend to be minimized 
for a smaller subtree. Equation 8.4 is reminiscent of the lasso (6.7) from 
Chapter 6, in which a similar formulation was used in order to control the 
complexity of a linear model. 

It turns out that as we increase a from zero in (8.4), branches get pruned 
from the tree in a nested and predictable fashion, so obtaining the whole 
sequence of subtrees as a function of a is easy. We can select a value of 
a using a validation set or using cross-validation. We then return to the 
full data set and obtain the subtree corresponding to a. This process is 
summarized in Algorithm 8.1. 
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FIGURE 8.4. Regression tree analysis for the Hitters data. The unpruned tree 
that results from top-down greedy splitting on the training data is shown. 


Figures 8.4 and 8.5 display the results of fitting and pruning a regression 
tree on the Hitters data, using nine of the features. First, we randomly 
divided the data set in half, yielding 132 observations in the training set 
and 131 observations in the test set. We then built a large regression tree 
on the training data and varied a in (8.4) in order to create subtrees with 
different numbers of terminal nodes. Finally, we performed six-fold cross- 
validation in order to estimate the cross-validated MSE of the trees as 
a function of a. (We chose to perform six-fold cross-validation because 
132 is an exact multiple of six.) The unpruned regression tree is shown 
in Figure 8.4. The green curve in Figure 8.5 shows the CV error as a 
function of the number of leaves, 2 while the orange curve indicates the 
test error. Also shown are standard error bars around the estimated errors. 
For reference, the training error curve is shown in black. The CV error 
is a reasonable approximation of the test error: the CV error takes on its 


2 Although CV error is computed as a function of a, it is convenient to display the 
result as a function of |T|, the number of leaves; this is based on the relationship between 
a and |T| in the original tree grown to all the training data. 
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FIGURE 8.5. Regression tree analysis for the Hitters data. The training, 
cross-validation, and test MSE are shown as a function of the number of termi¬ 
nal nodes in the pruned tree. Standard error bands are displayed. The minimum 
cross-validation error occurs at a tree size of three. 


minimum for a three-node tree, while the test error also dips down at the 
three-node tree (though it takes on its lowest value at the ten-node tree). 
The pruned tree containing three terminal nodes is shown in Figure 8.1. 


8.1.2 Classification Trees 

A classification tree is very similar to a regression tree, except that it is 
used to predict a qualitative response rather than a quantitative one. Re¬ 
call that for a regression tree, the predicted response for an observation is 
given by the mean response of the training observations that belong to the 
same terminal node. In contrast, for a classification tree, we predict that 
each observation belongs to the most commonly occurring class of training 
observations in the region to which it belongs. In interpreting the results of 
a classification tree, we are often interested not only in the class prediction 
corresponding to a particular terminal node region, but also in the class 
proportions among the training observations that fall into that region. 

The task of growing a classification tree is quite similar to the task of 
growing a regression tree. Just as in the regression setting, we use recursive 
binary splitting to grow a classification tree. However, in the classification 
setting, RSS cannot be used as a criterion for making the binary splits. 
A natural alternative to RSS is the classification error rate. Since we plan 
to assign an observation in a given region to the most commonly occurring 
class of training observations in that region, the classification error rate is 
simply the fraction of the training observations in that region that do not 
belong to the most common class: 


classification 

tree 


classification 
error rate 
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E = 1 - ma x(p m k). (8.5) 

k 

Here p m k represents the proportion of training observations in the mth 
region that are from the fcth class. However, it turns out that classification 
error is not sufficiently sensitive for tree-growing, and in practice two other 
measures are preferable. 

The Gini index is defined by 

K 

= ^ ( Pmfc(l Pmfc)] (8*6) 

fc= 1 

a measure of total variance across the K classes. It is not hard to see 
that the Gini index takes on a small value if all of the Pmk s are close to 
zero or one. For this reason the Gini index is referred to as a measure of 
node purity —a small value indicates that a node contains predominantly 
observations from a single class. 

An alternative to the Gini index is cross-entropy , given by 

K 

D= y ) Pmk k)g Pmk ■ ($-7) 

fc =1 

Since 0 < p m k < 1, it follows that 0 < —p m /c log One can show that 
the cross-entropy will take on a value near zero if the p m k s are all near 
zero or near one. Therefore, like the Gini index, the cross-entropy will take 
on a small value if the mth node is pure. In fact, it turns out that the Gini 
index and the cross-entropy are quite similar numerically. 

When building a classification tree, either the Gini index or the cross¬ 
entropy are typically used to evaluate the quality of a particular split, 
since these two approaches are more sensitive to node purity than is the 
classification error rate. Any of these three approaches might be used when 
pruning the tree, but the classification error rate is preferable if prediction 
accuracy of the final pruned tree is the goal. 

Figure 8.6 shows an example on the Heart data set. These data con¬ 
tain a binary outcome HD for 303 patients who presented with chest pain. 
An outcome value of Yes indicates the presence of heart disease based on 
an angiographic test, while No means no heart disease. There are 13 predic¬ 
tors including Age, Sex, Choi (a cholesterol measurement), and other heart 
and lung function measurements. Cross-validation results in a tree with six 
terminal nodes. 

In our discussion thus far, we have assumed that the predictor vari¬ 
ables take on continuous values. However, decision trees can be constructed 
even in the presence of qualitative predictor variables. For instance, in the 
Heart data, some of the predictors, such as Sex, Thai (Thalium stress test), 
and ChestPain, are qualitative. Therefore, a split on one of these variables 
amounts to assigning some of the qualitative values to one branch and 
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FIGURE 8.6. Heart data. Top: The unpruned tree. Bottom Left: Cross 
-validation error, training, and test error, for different sizes of the pruned tree. 
Bottom Right: The pruned tree corresponding to the minimal cross-validation 
error. 


assigning the remaining to the other branch. In Figure 8.6, some of the in¬ 
ternal nodes correspond to splitting qualitative variables. For instance, the 
top internal node corresponds to splitting Thai. The text Thai:a indicates 
that the left-hand branch coming out of that node consists of observations 
with the first value of the Thai variable (normal), and the right-hand node 
consists of the remaining observations (fixed or reversible defects). The text 
ChestPain:bc two splits down the tree on the left indicates that the left-hand 
branch coming out of that node consists of observations with the second 
and third values of the ChestPain variable, where the possible values are 
typical angina, atypical angina, non-anginal pain, and asymptomatic. 
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Figure 8.6 has a surprising characteristic: some of the splits yield two 
terminal nodes that have the same predicted value. For instance, consider 
the split RestECGci near the bottom right of the unpruned tree. Regardless 
of the value of RestECG, a response value of Yes is predicted for those ob¬ 
servations. Why, then, is the split performed at all? The split is performed 
because it leads to increased node purity. That is, all 9 of the observations 
corresponding to the right-hand leaf have a response value of Yes, whereas 
7/11 of those corresponding to the left-hand leaf have a response value of 
Yes. Why is node purity important? Suppose that we have a test obser¬ 
vation that belongs to the region given by that right-hand leaf. Then we 
can be pretty certain that its response value is Yes. In contrast, if a test 
observation belongs to the region given by the left-hand leaf, then its re¬ 
sponse value is probably Yes, but we are much less certain. Even though 
the split RestECGci does not reduce the classification error, it improves the 
Gini index and the cross-entropy, which are more sensitive to node purity. 

8.1.3 Trees Versus Linear Models 

Regression and classification trees have a very different flavor from the more 
classical approaches for regression and classification presented in Chapters 3 
and 4. In particular, linear regression assumes a model of the form 

p 

HX)=0 o + ]TJ% (8.8) 

3=1 

whereas regression trees assume a model of the form 

M 

f(X ) = ^ c m • 1 {XeRm) (8.9) 

m= 1 

where i?i,..., Rm represent a partition of feature space, as in Figure 8.3. 

Which model is better? It depends on the problem at hand. If the 
relationship between the features and the response is well approximated 
by a linear model as in (8.8), then an approach such as linear regression 
will likely work well, and will outperform a method such as a regression 
tree that does not exploit this linear structure. If instead there is a highly 
non-linear and complex relationship between the features and the response 
as indicated by model (8.9), then decision trees may outperform classical 
approaches. An illustrative example is displayed in Figure 8.7. The rela¬ 
tive performances of tree-based and classical approaches can be assessed by 
estimating the test error, using either cross-validation or the validation set 
approach (Chapter 5). 

Of course, other considerations beyond simply test error may come into 
play in selecting a statistical learning method; for instance, in certain set¬ 
tings, prediction using a tree may be preferred for the sake of interpretabil- 
ity and visualization. 
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FIGURE 8.7. Top Row: A two-dimensional classification example in which 
the true decision boundary is linear, and is indicated by the shaded regions. 
A classical approach that assumes a linear boundary (left) will outperform a de¬ 
cision tree that performs splits parallel to the axes (right). Bottom Row: Here the 
true decision boundary is non-linear. Here a linear model is unable to capture 
the true decision boundary (left), whereas a decision tree is successful (right). 


8.1.4 Advantages and Disadvantages of Trees 

Decision trees for regression and classification have a number of advantages 
over the more classical approaches seen in Chapters 3 and 4: 

▲ Trees are very easy to explain to people. In fact, they are even easier 
to explain than linear regression! 

A Some people believe that decision trees more closely mirror human 
decision-making than do the regression and classification approaches 
seen in previous chapters. 

▲ Trees can be displayed graphically, and are easily interpreted even by 
a non-expert (especially if they are small). 

▲ Trees can easily handle qualitative predictors without the need to 
create dummy variables. 
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T Unfortunately, trees generally do not have the same level of predictive 
accuracy as some of the other regression and classification approaches 
seen in this book. 

However, by aggregating many decision trees, using methods like bagging, 
random forests , and boosting , the predictive performance of trees can be 
substantially improved. We introduce these concepts in the next section. 

8.2 Bagging, Random Forests, Boosting 

Bagging, random forests, and boosting use trees as building blocks to 
construct more powerful prediction models. 

8.2.1 Bagging 

The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is 
used in many situations in which it is hard or even impossible to directly 
compute the standard deviation of a quantity of interest. We see here that 
the bootstrap can be used in a completely different context, in order to 
improve statistical learning methods such as decision trees. 

The decision trees discussed in Section 8.1 suffer from high variance. 
This means that if we split the training data into two parts at random, 
and fit a decision tree to both halves, the results that we get could be 
quite different. In contrast, a procedure with low variance will yield similar 
results if applied repeatedly to distinct data sets; linear regression tends 
to have low variance, if the ratio of n to p is moderately large. Bootstrap 
aggregation, or bagging, is a general-purpose procedure for reducing the 
variance of a statistical learning method; we introduce it here because it is 
particularly useful and frequently used in the context of decision trees. 

Recall that given a set of n independent observations Z\,..., Z n , each 
with variance er 2 , the variance of the mean Z of the observations is given 
by er 2 /n. In other words, averaging a set of observations reduces variance. 
Hence a natural way to reduce the variance and hence increase the predic¬ 
tion accuracy of a statistical learning method is to take many training sets 
from the population, build a separate prediction model using each training 
set, and average the resulting predictions. In other words, we could cal¬ 
culate f 1 {x), f 2 (x),..., f B (x ) using B separate training sets, and average 
them in order to obtain a single low-variance statistical learning model, 
given by 



Of course, this is not practical because we generally do not have access 
to multiple training sets. Instead, we can bootstrap, by taking repeated 
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samples from the (single) training data set. In this approach we generate 
B different bootstrapped training data sets. We then train our method on 
the &th bootstrapped training set in order to get f* b (x), and finally average 
all the predictions, to obtain 


/bag (a) = 

^ 6=1 


This is called bagging. 

While bagging can improve predictions for many regression methods, 
it is particularly useful for decision trees. To apply bagging to regression 
trees, we simply construct B regression trees using B bootstrapped training 
sets, and average the resulting predictions. These trees are grown deep, 
and are not pruned. Hence each individual tree has high variance, but 
low bias. Averaging these B trees reduces the variance. Bagging has been 
demonstrated to give impressive improvements in accuracy by combining 
together hundreds or even thousands of trees into a single procedure. 

Thus far, we have described the bagging procedure in the regression 
context, to predict a quantitative outcome Y. How can bagging be extended 
to a classification problem where Y is qualitative? In that situation, there 
are a few possible approaches, but the simplest is as follows. For a given test 
observation, we can record the class predicted by each of the B trees, and 
take a majority vote: the overall prediction is the most commonly occurring 
class among the B predictions. 

Figure 8.8 shows the results from bagging trees on the Heart data. The 
test error rate is shown as a function of B , the number of trees constructed 
using bootstrapped training data sets. We see that the bagging test error 
rate is slightly lower in this case than the test error rate obtained from a 
single tree. The number of trees B is not a critical parameter with bagging; 
using a very large value of B will not lead to overfitting. In practice we 
use a value of B sufficiently large that the error has settled down. Using 
B = 100 is sufficient to achieve good performance in this example. 

Out-of-Bag Error Estimation 

It turns out that there is a very straightforward way to estimate the test 
error of a bagged model, without the need to perform cross-validation or 
the validation set approach. Recall that the key to bagging is that trees are 
repeatedly fit to bootstrapped subsets of the observations. One can show 
that on average, each bagged tree makes use of around two-thirds of the 
observations. 3 The remaining one-third of the observations not used to fit a 
given bagged tree are referred to as the out-of-bag (OOB) observations. We 
can predict the response for the zth observation using each of the trees in 


majority 

vote 


out-of-bag 


3 This relates to Exercise 2 of Chapter 5. 
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FIGURE 8.8. Bagging and random forest results for the Heart data. The test 
error (black and orange) is shown as a function of B, the number of bootstrapped 
training sets used. Random forests were applied with m = y/p. The dashed line 
indicates the test error resulting from a single classification tree. The green and 
blue traces show the OOB error, which in this case is considerably lower. 

which that observation was OOB. This will yield around B/3 predictions 
for the *th observation. In order to obtain a single prediction for the ith 
observation, we can average these predicted responses (if regression is the 
goal) or can take a majority vote (if classification is the goal). This leads 
to a single OOB prediction for the *th observation. An OOB prediction 
can be obtained in this way for each of the n observations, from which the 
overall OOB MSE (for a regression problem) or classification error (for a 
classification problem) can be computed. The resulting OOB error is a valid 
estimate of the test error for the bagged model, since the response for each 
observation is predicted using only the trees that were not fit using that 
observation. Figure 8.8 displays the OOB error on the Heart data. It can 
be shown that with B sufficiently large, OOB error is virtually equivalent 
to leave-one-out cross-validation error. The OOB approach for estimating 
the test error is particularly convenient when performing bagging on large 
data sets for which cross-validation would be computationally onerous. 

Variable Importance Measures 

As we have discussed, bagging typically results in improved accuracy over 
prediction using a single tree. Unfortunately, however, it can be difficult to 
interpret the resulting model. Recall that one of the advantages of decision 
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FIGURE 8.9. A variable importance plot for the Heart data. Variable impor¬ 
tance is computed using the mean decrease in Gini index, and expressed relative 
to the maximum. 


trees is the attractive and easily interpreted diagram that results, such as 
the one displayed in Figure 8.1. However, when we bag a large number of 
trees, it is no longer possible to represent the resulting statistical learning 
procedure using a single tree, and it is no longer clear which variables 
are most important to the procedure. Thus, bagging improves prediction 
accuracy at the expense of interpretability. 

Although the collection of bagged trees is much more difficult to interpret 
than a single tree, one can obtain an overall summary of the importance of 
each predictor using the RSS (for bagging regression trees) or the Gini index 
(for bagging classification trees). In the case of bagging regression trees, we 
can record the total amount that the RSS (8.1) is decreased due to splits 
over a given predictor, averaged over all B trees. A large value indicates 
an important predictor. Similarly, in the context of bagging classification 
trees, we can add up the total amount that the Gini index (8.6) is decreased 
by splits over a given predictor, averaged over all B trees. 

A graphical representation of the variable importances in the Heart data 
is shown in Figure 8.9. We see the mean decrease in Gini index for each vari¬ 
able, relative to the largest. The variables with the largest mean decrease 
in Gini index are Thai, Ca, and ChestPain. 
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8.2.2 Random Forests 

Random forests provide an improvement over bagged trees by way of a 
small tweak that decorrelates the trees. As in bagging, we build a number 
of decision trees on bootstrapped training samples. But when building these 
decision trees, each time a split in a tree is considered, a random sample of 
m predictors is chosen as split candidates from the full set of p predictors. 
The split is allowed to use only one of those to predictors. A fresh sample of 
to predictors is taken at each split, and typically we choose to ~ yfp —that 
is, the number of predictors considered at each split is approximately equal 
to the square root of the total number of predictors (4 out of the 13 for the 
Heart data). 

In other words, in building a random forest, at each split in the tree, 
the algorithm is not even allowed to consider a majority of the available 
predictors. This may sound crazy, but it has a clever rationale. Suppose 
that there is one very strong predictor in the data set, along with a num¬ 
ber of other moderately strong predictors. Then in the collection of bagged 
trees, most or all of the trees will use this strong predictor in the top split. 
Consequently, all of the bagged trees will look quite similar to each other. 
Hence the predictions from the bagged trees will be highly correlated. Un¬ 
fortunately, averaging many highly correlated quantities does not lead to 
as large of a reduction in variance as averaging many uncorrelated quanti¬ 
ties. In particular, this means that bagging will not lead to a substantial 
reduction in variance over a single tree in this setting. 

Random forests overcome this problem by forcing each split to consider 
only a subset of the predictors. Therefore, on average ( p — m)/p of the 
splits will not even consider the strong predictor, and so other predictors 
will have more of a chance. We can think of this process as decorrelating 
the trees, thereby making the average of the resulting trees less variable 
and hence more reliable. 

The main difference between bagging and random forests is the choice 
of predictor subset size to. For instance, if a random forest is built using 
to = p, then this amounts simply to bagging. On the Heart data, random 
forests using to = v tp leads to a reduction in both test error and OOB error 
over bagging (Figure 8.8). 

Using a small value of to in building a random forest will typically be 
helpful when we have a large number of correlated predictors. We applied 
random forests to a high-dimensional biological data set consisting of ex¬ 
pression measurements of 4,718 genes measured on tissue samples from 349 
patients. There are around 20,000 genes in humans, and individual genes 
have different levels of activity, or expression, in particular cells, tissues, 
and biological conditions. In this data set, each of the patient samples has 
a qualitative label with 15 different levels: either normal or 1 of 14 different 
types of cancer. Our goal was to use random forests to predict cancer type 
based on the 500 genes that have the largest variance in the training set. 
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FIGURE 8.10. Results from random forests for the 15-class gene expression 
data set with p = 500 predictors. The test error is displayed as a function of 
the number of trees. Each colored line corresponds to a different value of m, the 
number of predictors available for splitting at each interior tree node. Random 
forests (m < p) lead to a slight improvement over bagging (m = p). A single 
classification tree has an error rate of 45.7%. 

We randomly divided the observations into a training and a test set, and 
applied random forests to the training set for three different values of the 
number of splitting variables m. The results are shown in Figure 8.10. The 
error rate of a single tree is 45.7 %, and the null rate is 75.4 %. 4 We see that 
using 400 trees is sufficient to give good performance, and that the choice 
m = y/p gave a small improvement in test error over bagging (m = p) in 
this example. As with bagging, random forests will not overfit if we increase 
B 1 so in practice we use a value of B sufficiently large for the error rate to 
have settled down. 

8.2.3 Boosting 

We now discuss boosting , yet another approach for improving the predic¬ 
tions resulting from a decision tree. Like bagging, boosting is a general 
approach that can be applied to many statistical learning methods for re¬ 
gression or classification. Here we restrict our discussion of boosting to the 
context of decision trees. 

Recall that bagging involves creating multiple copies of the original train¬ 
ing data set using the bootstrap, fitting a separate decision tree to each 
copy, and then combining all of the trees in order to create a single predic¬ 
tive model. Notably, each tree is built on a bootstrap data set, independent 


4 The null rate results from simply classifying each observation to the dominant class 
overall, which is in this case the normal class. 
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of the other trees. Boosting works in a similar way, except that the trees are 
grown sequentially : each tree is grown using information from previously 
grown trees. Boosting does not involve bootstrap sampling; instead each 
tree is fit on a modified version of the original data set. 


Algorithm 8.2 Boosting for Regression Trees 

1. 

Set f(x) = 0 and r, = yi for all i in the training set. 


2. 

For b = 1, 2,... , B, repeat: 



(a) Fit a tree f b with d splits (d + 1 terminal nodes) to the training 
data (A, r). 

(b) Update / by adding in a shrunken version of the new tree: 


f ( x ) /O) + X f b (x)- 

(8.10) 


(c) Update the residuals, 



n «- n - A f b (xi). 

(8.11) 

3. 

Output the boosted model, 



K x ) = Y X f b ( x )' 

b—1 

(8.12) 


Consider first the regression setting. Like bagging, boosting involves com¬ 
bining a large number of decision trees, f 1 ,..., f B . Boosting is described 
in Algorithm 8.2. 

What is the idea behind this procedure? Unlike fitting a single large deci¬ 
sion tree to the data, which amounts to fitting the data hard and potentially 
overfitting, the boosting approach instead learns slowly. Given the current 
model, we fit a decision tree to the residuals from the model. That is, we 
fit a tree using the current residuals, rather than the outcome Y, as the re¬ 
sponse. We then add this new decision tree into the fitted function in order 
to update the residuals. Each of these trees can be rather small, with just 
a few terminal nodes, determined by the parameter d in the algorithm. By 
fitting small trees to the residuals, we slowly improve / in areas where it 
does not perform well. The shrinkage parameter A slows the process down 
even further, allowing more and different shaped trees to attack the resid¬ 
uals. In general, statistical learning approaches that learn slowly tend to 
perform well. Note that in boosting, unlike in bagging, the construction of 
each tree depends strongly on the trees that have already been grown. 
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FIGURE 8.11. Results from performing boosting and random forests on the 
15-class gene expression data set in order to predict cancer versus normal. The 
test error is displayed as a function of the number of trees. For the two boosted 
models, A = 0.01. Depth-1 trees slightly outperform depth-2 trees, and both out¬ 
perform the random forest, although the standard errors are around 0.02, making 
none of these differences significant. The test error rate for a single tree is 24 %■ 

We have just described the process of boosting regression trees. Boosting 
classification trees proceeds in a similar but slightly more complex way, and 
the details are omitted here. 

Boosting has three tuning parameters: 

1. The number of trees B. Unlike bagging and random forests, boosting 
can overfit if B is too large, although this overfitting tends to occur 
slowly if at all. We use cross-validation to select B. 

2. The shrinkage parameter A, a small positive number. This controls the 
rate at which boosting learns. Typical values are 0.01 or 0.001, and 
the right choice can depend on the problem. Very small A can require 
using a very large value of B in order to achieve good performance. 

3. The number d of splits in each tree, which controls the complexity 
of the boosted ensemble. Often d = 1 works well, in which case each 
tree is a stump, consisting of a single split. In this case, the boosted 
ensemble is fitting an additive model, since each term involves only a 
single variable. More generally d is the interaction depth , and controls 
the interaction order of the boosted model, since d splits can involve 
at most d variables. 

In Figure 8.11, we applied boosting to the 15-class cancer gene expression 
data set, in order to develop a classifier that can distinguish the normal 
class from the 14 cancer classes. We display the test error as a function of 
the total number of trees and the interaction depth d. We see that simple 
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stumps with an interaction depth of one perform well if enough of them 
are included. This model outperforms the depth-two model, and both out¬ 
perform a random forest. This highlights one difference between boosting 
and random forests: in boosting, because the growth of a particular tree 
takes into account the other trees that have already been grown, smaller 
trees are typically sufficient. Using smaller trees can aid in interpretability 
as well; for instance, using stumps leads to an additive model. 


8.3 Lab: Decision Trees 

8.3.1 Fitting Classification Trees 

The tree library is used to construct classification and regression trees. 

> library(tree) 

We first use classification trees to analyze the Carseats data set. In these 
data, Sales is a continuous variable, and so we begin by recoding it as a 
binary variable. We use the ifelseO function to create a variable, called 
High, which takes on a value of Yes if the Sales variable exceeds 8, and 
takes on a value of No otherwise. 

> library(ISLR) 

> attach(Carseats) 

> High=ifelse(Sales<=8,"No","Yes") 

Finally, we use the data.frame () function to merge High with the rest of 
the Carseats data. 

> Carseats =data.frame(Carseats ,High) 

We now use the treeO function to fit a classification tree in order to predict 
High using all variables but Sales. The syntax of the treeO function is quite 
similar to that of the lm() function. 

> tree.carseats=tree(High~. - Sales ,Carseats) 

The summary () function lists the variables that are used as internal nodes 
in the tree, the number of terminal nodes, and the (training) error rate. 

> summary(tree.carseats) 

Classification tree: 

tree(formula = High ~ . - Sales, data = Carseats) 

Variables actually used in tree construction: 

[1] "ShelveLoc" "Price" "Income" "CompPrice" 

[5] "Population" "Advertising" "Age" "US" 

Number of terminal nodes : 27 

Residual mean deviance: 0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400 
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We see that the training error rate is 9%. For classification trees, the de¬ 
viance reported in the output of summary () is given by 

-^EE Tlmk log p mkt 
m k 

where n m k is the number of observations in the mth terminal node that 
belong to the fcth class. A small deviance indicates a tree that provides 
a good fit to the (training) data. The residual mean deviance reported is 
simply the deviance divided by n— |T 0 |, which in this case is 400 — 27 = 373. 

One of the most attractive properties of trees is that they can be 
graphically displayed. We use the plotO function to display the tree struc¬ 
ture, and the text() function to display the node labels. The argument 
pretty=0 instructs R to include the category names for any qualitative pre¬ 
dictors, rather than simply displaying a letter for each category. 

> plot(tree.carseats) 

> text(tree.carseats,pretty=0) 

The most important indicator of Sales appears to be shelving location, 
since the first branch differentiates Good locations from Bad and Medium 
locations. 

If we just type the name of the tree object, R prints output corresponding 
to each branch of the tree. R displays the split criterion (e.g. Price<92.5), the 
number of observations in that branch, the deviance, the overall prediction 
for the branch (Yes or No), and the fraction of observations in that branch 
that take on values of Yes and No. Branches that lead to terminal nodes are 
indicated using asterisks. 

> tree.carseats 

node), split, n, deviance, yval, (yprob) 

* denotes terminal node 
1) root 400 541.5 No ( 0.590 0.410 ) 

2) ShelveLoc: Bad,Medium 315 390.6 No ( 0.689 0.311 ) 

4) Price < 92.5 46 56.53 Yes ( 0.304 0.696 ) 

8) Income < 57 10 12.22 No ( 0.700 0.300 ) 

In order to properly evaluate the performance of a classification tree on 
these data, we must estimate the test error rather than simply computing 
the training error. We split the observations into a training set and a test 
set, build the tree using the training set, and evaluate its performance on 
the test data. The predict () function can be used for this purpose. In the 
case of a classification tree, the argument type="class" instructs R to return 
the actual class prediction. This approach leads to correct predictions for 
around 71.5 % of the locations in the test data set. 

> set.seed(2) 

> train = sample(1:nrow(Carseats) , 200) 

> Carseats .test = Carseats [-train ,] 

> High . test = High[-train] 
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> tree.carseats=tree(High~.-Sales,Carseats,subset=train) 

> tree.pred=predict(tree.carseats,Carseats.test,type="class") 

> table(tree.pred,High.test) 

High . test 
tree.pred No Yes 
No 86 27 

Yes 30 57 

> (86+57)/200 

[1] 0.715 

Next, we consider whether pruning the tree might lead to improved 
results. The function cv.treeO performs cross-validation in order to 
determine the optimal level of tree complexity; cost complexity pruning 
is used in order to select a sequence of trees for consideration. We use 
the argument FUN=prune .misclass in order to indicate that we want the 
classification error rate to guide the cross-validation and pruning process, 
rather than the default for the cv.treeO function, which is deviance. The 
cv.treeO function reports the number of terminal nodes of each tree con¬ 
sidered (size) as well as the corresponding error rate and the value of the 
cost-complexity parameter used (k, which corresponds to a in (8.4)). 

> set.seed(3) 

> cv .carseats=cv.tree(tree.carseats,FUN=prune.misclass) 

> names(cv.carseats) 

[1] "size" "dev" "k" "method" 

> cv.carseats 
$size 

[1] 19 17 14 13 9 7 3 2 1 

$dev 

[1] 55 55 53 52 50 56 69 65 80 

$k 

[1] -Inf 0.0000000 0.6666667 1.0000000 1.7500000 

2.0000000 4.2500000 

[8] 5.0000000 23.0000000 

$method 

[1] "misclass" 
attr(,"class") 

[1] "prune" "tree.sequence" 

Note that, despite the name, dev corresponds to the cross-validation error 
rate in this instance. The tree with 9 terminal nodes results in the lowest 
cross-validation error rate, with 50 cross-validation errors. We plot the error 
rate as a function of both size and k. 

> par(mfrow=c(1,2)) 

> plot(cv.carseats$size ,cv.carseats$dev,type="b") 

> plot(cv.carseats$k,cv.carseats$dev,type="b") 


.treeQ 
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We now apply the prune.misclassO function in order to prune the tree to 
obtain the nine-node tree. 

> prune.carseats =prune.misclass(tree.carseats,best=9) 

> plot (prune . car seat s ) 

> text(prune.carseats,pretty=0) 

How well does this pruned tree perform on the test data set? Once again, 
we apply the predict () function. 

> tree.pred=predict(prune.carseats,Carseats.test, type = "class") 

> table(tree.pred,High.test) 

High . test 
tree.pred No Yes 
No 94 24 

Yes 22 60 

> (94 + 60) /200 

[1] 0.77 

Now 77 % of the test observations are correctly classified, so not only has 
the pruning process produced a more interpretable tree, but it has also 
improved the classification accuracy. 

If we increase the value of best, we obtain a larger pruned tree with lower 
classification accuracy: 

> prune . car seats = prune .misclass (tree . car seats ,best = 15) 

> plot(prune.carseats ) 

> text(prune.carseats,pretty=0) 

> tree.pred=predict(prune.carseats,Carseats.test,type="class") 

> table(tree.pred,High.test) 

High.test 
tree.pred No Yes 
No 86 22 

Yes 30 62 

> (86 + 62) /200 

[1] 0.74 


8.3.2 Fitting Regression Trees 

Here we fit a regression tree to the Boston data set. First, we create a 
training set, and fit the tree to the training data. 

> library(MASS) 

> set . seed (1) 

> train = sample(1:nrow(Boston) , nrow(Boston)/2) 

> tree.boston = tree(medv~. ,Boston ,subset = train) 

> summary(tree.boston) 

Regression tree : 

tree (formula = medv ~ data = Boston, subset = train) 

Variables actually used in tree construction: 

[1] " 1st at " "rm" "dis" 

Number of terminal nodes : 8 


prune. 
misclass() 
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Residual mean deviance: 12.65 = 3099 / 245 

Distribution of residuals: 

Min. 1st Qu. Median Mean 3rd Qu. Max. 

-14.1000 -2.0420 -0.0536 0.0000 1.9600 12.6000 

Notice that the output of summary () indicates that only three of the vari¬ 
ables have been used in constructing the tree. In the context of a regression 
tree, the deviance is simply the sum of squared errors for the tree. We now 
plot the tree. 

> plot(tree.boston) 

> text(tree.boston,pretty=0) 

The variable lstat measures the percentage of individuals with lower 
socioeconomic status. The tree indicates that lower values of lstat cor¬ 
respond to more expensive houses. The tree predicts a median house price 
of $46,400 for larger homes in suburbs in which residents have high socioe¬ 
conomic status (rm>=7.437 and lstat<9.715). 

Now we use the cv.treeO function to see whether pruning the tree will 
improve performance. 

> cv.boston=cv.tree(tree.boston) 

> plot(cv.boston$size,cv.boston$dev,type =, b’) 

In this case, the most complex tree is selected by cross-validation. How¬ 
ever, if we wish to prune the tree, we could do so as follows, using the 
prune.tree() function: 

prune.tree() 

> prune . boston = prune . tree (tree . boston ,best=5) 

> plot(prune.boston) 

> text (prune . boston , pretty =0) 

In keeping with the cross-validation results, we use the unpruned tree to 
make predictions on the test set. 

> yhat = predict(tree.boston,newdata = Boston [-train ,]) 

> boston.test=Boston[-train,"medv"] 

> plot(yhat ,boston.test) 

> abline (0,1) 

> mean((yhat-boston.test)~2) 

[1] 25.05 

In other words, the test set MSE associated with the regression tree is 
25.05. The square root of the MSE is therefore around 5.005, indicating 
that this model leads to test predictions that are within around $5, 005 of 
the true median home value for the suburb. 


8.3.3 Bagging and Random Forests 

Here we apply bagging and random forests to the Boston data, using the 
randomForest package in R. The exact results obtained in this section may 
depend on the version of R and the version of the randomForest package 
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installed on your computer. Recall that bagging is simply a special case of 
a random forest with m = p. Therefore, the randomForest () function can 
be used to perform both random forests and bagging. We perform bagging 
as follows: 

> library(randomForest) 

> set . seed (1) 

> bag.boston = randomForest(medv~. ,data = Boston ,subset=train, 

mtry=13,importance=TRUE) 

> bag.boston 

Call : 

randomForest(formula = medv ~ data = Boston, mtry = 13, 


importance = TRUE , 

subset = 

train) 

Type of random forest : 

regression 

Number of trees : 

500 

of variables tried at 

each split : 

13 

Mean of squared 

residuals : 

10.77 

i Var 

explained : 

86.96 


The argument mtry=13 indicates that all 13 predictors should be considered 
for each split of the tree—in other words, that bagging should be done. How 
well does this bagged model perform on the test set? 

> yhat.bag = predict(bag.boston ,newdata = Boston [-train ,]) 

> plot(yhat.bag, boston.test) 

> abline (0,1) 

> mean((yhat.bag-boston.test) "2) 

[1] 13.16 

The test set MSE associated with the bagged regression tree is 13.16, almost 
half that obtained using an optimally-pruned single tree. We could change 
the number of trees grown by randomForest () using the ntree argument: 

> bag.boston = randomForest(medv~. ,data = Boston ,subset=train, 

mtry=13,ntree=25) 

> yhat.bag = predict(bag.boston,newdata = Boston [-train ,]) 

> mean (( yhat . bag-boston . test)'‘2) 

[1] 13.31 

Growing a random forest proceeds in exactly the same way, except that 
we use a smaller value of the mtry argument. By default, randomForest () 
uses p/3 variables when building a random forest of regression trees, and 
yjp variables when building a random forest of classification trees. Here we 
use mtry = 6. 

> set . seed (1) 

> rf.boston = randomForest(medv~. ,data = Boston,subset=train , 

mtry=6,importance =TRUE) 

> yhat.rf = predict(rf.boston ,newdata = Boston [-train ,]) 

> mean((yhat.rf-boston.test)~2) 

[1] 11.31 


random 
Forest() 
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The test set MSE is 11.31; this indicates that random forests yielded an 
improvement over bagging in this case. 

Using the importance () function, we can view the importance of each 

importance() 

variable. 


> importance(rf.boston) 



•/.IncMSE 

IncNodePurity 

crim 

12.384 

1051.54 

zn 

2.103 

50.31 

indus 

8.390 

1017.64 

chas 

2.294 

56.32 

nox 

12.791 

1107.31 

rm 

30.754 

5917.26 

age 

10.334 

552.27 

dis 

14.641 

1223.93 

rad 

3.583 

84.30 

tax 

8.139 

435.71 

ptratio 

11.274 

817.33 

black 

8.097 

367.00 

lstat 

30.962 

7713.63 


Two measures of variable importance are reported. The former is based 
upon the mean decrease of accuracy in predictions on the out of bag samples 
when a given variable is excluded from the model. The latter is a measure 
of the total decrease in node impurity that results from splits over that 
variable, averaged over all trees (this was plotted in Figure 8.9). In the 
case of regression trees, the node impurity is measured by the training 
RSS, and for classification trees by the deviance. Plots of these importance 
measures can be produced using the varlmpPlotO function. 

> varlmpPlot(rf.boston) 


varlmpPlot() 


The results indicate that across all of the trees considered in the random 
forest, the wealth level of the community (lstat) and the house size (rm) 
are by far the two most important variables. 


8.3.4 Boosting 

Here we use the gbm package, and within it the gbm() function, to fit boosted 
regression trees to the Boston data set. We run gbm() with the option gbm( '' ) 
distribution="gaussian" since this is a regression problem; if it were a bi¬ 
nary classification problem, we would use distribution="bernoulli". The 
argument n.trees=5000 indicates that we want 5000 trees, and the option 
interaction.depth=4 limits the depth of each tree. 

> library(gbm) 

> set . seed (1) 

> boost.boston = gbm(medv~. ,data = Boston [train ,] ,distribution = 

"gaussian",n.trees=5000,interaction.depth=4) 

The summary () function produces a relative influence plot and also outputs 
the relative influence statistics. 
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> 

summary(boost. 

bost 


var 

r el 

. inf 

1 

lstat 

45 . 

96 

2 

rm 

31 . 

22 

3 

dis 

6 . 

81 

4 

cr im 

4 . 

07 

5 

nox 

2 . 

56 

6 

ptratio 

2 . 

27 

7 

black 

1 . 

80 

8 

age 

1 . 

64 

9 

tax 

1 . 

36 

10 

indus 

1 . 

27 

11 

chas 

0 . 

80 

12 

rad 

0 . 

20 

13 

zn 

0 . 

015 


We see that lstat and rm are by far the most important variables. We can 
also produce partial dependence plots for these two variables. These plots 
illustrate the marginal effect of the selected variables on the response after 
integrating out the other variables. In this case, as we might expect, median 
house prices are increasing with rm and decreasing with lstat. 


> par(mfrow=c(1,2)) 

> plot (boost . boston , i = " rm " ) 

> plot (boost . boston , i = " lstat " ) 


We now use the boosted model to predict medv on the test set: 


> yhat .boost = predict (boost .boston , ne wd at a = Boston [-train ,] , 

n.trees =5000) 

> me an ((yhat . boost - boston . test) "2) 

[1] 11.8 

The test MSE obtained is 11.8; similar to the test MSE for random forests 
and superior to that for bagging. If we want to, we can perform boosting 
with a different value of the shrinkage parameter A in (8.10). The default 
value is 0.001, but this is easily modified. Here we take A = 0.2. 

> boost.boston = gbm(medv~. ,data = Boston[train,] ,distribution= 

"gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2, 
verbose =F) 

> yhat .boost=predict (boost .boston ,newdata = Boston [-train ,] , 

n.trees =5000) 

> me an ((yhat .boost-boston . test) ~2) 

[1] 11.5 


partial 

dependence 

plot 


In this case, using A = 0.2 leads to a slightly lower test MSE than A = 0.001. 
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8.4 Exercises 

Conceptual 

1. Draw an example (of your own invention) of a partition of two- 
dimensional feature space that could result from recursive binary 
splitting. Your example should contain at least six regions. Draw a 
decision tree corresponding to this partition. Be sure to label all as¬ 
pects of your figures, including the regions i?i, i? 2 , • •the outpoints 
ti, t 2 , ■ ■and so forth. 

Hint: Your result should look something like Figures 8.1 and 8.2. 

2. It is mentioned in Section 8.2.3 that boosting using depth-one trees 
(or stumps ) leads to an additive model: that is, a model of the form 

f(X) = ^ fj (Xj ). 

j=i 

Explain why this is the case. You can begin with (8.12) in 
Algorithm 8.2. 

3. Consider the Gini index, classification error, and cross-entropy in a 
simple classification setting with two classes. Create a single plot 
that displays each of these quantities as a function of p m i- The x- 
axis should display p m i, ranging from 0 to 1, and the y-axis should 
display the value of the Gini index, classification error, and entropy. 

Hint: In a setting with two classes, p m \ = 1 — p m 2 - You could make 
this plot by hand, but it will be much easier to make in R. 

4. This question relates to the plots in Figure 8.12. 

(a) Sketch the tree corresponding to the partition of the predictor 
space illustrated in the left-hand panel of Figure 8.12. The num¬ 
bers inside the boxes indicate the mean of Y within each region. 

(b) Create a diagram similar to the left-hand panel of Figure 8.12, 
using the tree illustrated in the right-hand panel of the same 
figure. You should divide up the predictor space into the correct 
regions, and indicate the mean for each region. 

5. Suppose we produce ten bootstrapped samples from a data set 
containing red and green classes. We then apply a classification tree 
to each bootstrapped sample and, for a specific value of X , produce 
10 estimates of P(Class is Red|X): 


0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75. 
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X2< 1 


<2 


2.49 

FIGURE 8.12. Left: A partition of the predictor space corresponding to Exer¬ 
cise 4a. Right; A tree corresponding to Exercise 4b- 

There are two common ways to combine these results together into a 
single class prediction. One is the majority vote approach discussed in 
this chapter. The second approach is to classify based on the average 
probability. In this example, what is the final classification under each 
of these two approaches? 

6. Provide a detailed explanation of the algorithm that is used to fit a 
regression tree. 



xi 


x, 


-1.80 


0.63 


X2 

XI <0 


-i .06 o.; 


Applied 

7. In the lab, we applied random forests to the Boston data using mtry=6 
and using ntree=25 and ntree=500. Create a plot displaying the test 
error resulting from random forests on this data set for a more com¬ 
prehensive range of values for mtry and ntree. You can model your 
plot after Figure 8.10. Describe the results obtained. 

8 . In the lab, a classification tree was applied to the Carseats data set af¬ 
ter converting Sales into a qualitative response variable. Now we will 
seek to predict Sales using regression trees and related approaches, 
treating the response as a quantitative variable. 

(a) Split the data set into a training set and a test set. 

(b) Fit a regression tree to the training set. Plot the tree, and inter¬ 
pret the results. What test MSE do you obtain? 

(c) Use cross-validation in order to determine the optimal level of 
tree complexity. Does pruning the tree improve the test MSE? 

(d) Use the bagging approach in order to analyze this data. What 
test MSE do you obtain? Use the importance 0 function to de¬ 
termine which variables are most important. 
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(e) Use random forests to analyze this data. What test MSE do you 
obtain? Use the importance() function to determine which vari¬ 
ables are most important. Describe the effect of m , the number of 
variables considered at each split, on the error rate 
obtained. 

9. This problem involves the QJ data set which is part of the ISLR 
package. 

(a) Create a training set containing a random sample of 800 obser¬ 
vations, and a test set containing the remaining observations. 

(b) Fit a tree to the training data, with Purchase as the response 
and the other variables except for Buy as predictors. Use the 
summary () function to produce summary statistics about the 
tree, and describe the results obtained. What is the training 
error rate? How many terminal nodes does the tree have? 

(c) Type in the name of the tree object in order to get a detailed 
text output. Pick one of the terminal nodes, and interpret the 
information displayed. 

(d) Create a plot of the tree, and interpret the results. 

(e) Predict the response on the test data, and produce a confusion 
matrix comparing the test labels to the predicted test labels. 
What is the test error rate? 

(f) Apply the cv.treeO function to the training set in order to 
determine the optimal tree size. 

(g) Produce a plot with tree size on the :r-axis and cross-validated 
classification error rate on the y- axis. 

(h) Which tree size corresponds to the lowest cross-validated classi¬ 
fication error rate? 

(i) Produce a pruned tree corresponding to the optimal tree size 
obtained using cross-validation. If cross-validation does not lead 
to selection of a pruned tree, then create a pruned tree with five 
terminal nodes. 

(j) Compare the training error rates between the pruned and un¬ 
pruned trees. Which is higher? 

(k) Compare the test error rates between the pruned and unpruned 
trees. Which is higher? 

10. We now use boosting to predict Salary in the Hitters data set. 

(a) Remove the observations for whom the salary information is 
unknown, and then log-transform the salaries. 
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(b) Create a training set consisting of the first 200 observations, and 
a test set consisting of the remaining observations. 

(c) Perforin boosting on the training set with 1,000 trees for a range 
of values of the shrinkage parameter A. Produce a plot with 
different shrinkage values on the ir-axis and the corresponding 
training set MSE on the y-axis. 

(d) Produce a plot with different shrinkage values on the cc-axis and 
the corresponding test set MSE on the y- axis. 

(e) Compare the test MSE of boosting to the test MSE that results 
from applying two of the regression approaches seen in 
Chapters 3 and 6. 

(f) Which variables appear to be the most important predictors in 
the boosted model? 

(g) Now apply bagging to the training set. What is the test set MSE 
for this approach? 

11. This question uses the Caravan data set. 

(a) Create a training set consisting of the first 1,000 observations, 
and a test set consisting of the remaining observations. 

(b) Fit a boosting model to the training set with Purchase as the 
response and the other variables as predictors. Use 1,000 trees, 
and a shrinkage value of 0.01. Which predictors appear to be 
the most important? 

(c) Use the boosting model to predict the response on the test data. 
Predict that a person will make a purchase if the estimated prob¬ 
ability of purchase is greater than 20 %. Form a confusion ma¬ 
trix. What fraction of the people predicted to make a purchase 
do in fact make one? How does this compare with the results 
obtained from applying KNN or logistic regression to this data 
set? 

12. Apply boosting, bagging, and random forests to a data set of your 
choice. Be sure to fit the models on a training set and to evaluate their 
performance on a test set. How accurate are the results compared 
to simple methods like linear or logistic regression? Which of these 
approaches yields the best performance? 


9 
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In this chapter, we discuss the support vector machine (SVM), an approach 
for classification that was developed in the computer science community in 
the 1990s and that has grown in popularity since then. SVMs have been 
shown to perform well in a variety of settings, and are often considered one 
of the best “out of the box” classifiers. 

The support vector machine is a generalization of a simple and intu¬ 
itive classifier called the maximal margin classifier , which we introduce in 
Section 9.1. Though it is elegant and simple, we will see that this classifier 
unfortunately cannot be applied to most data sets, since it requires that 
the classes be separable by a linear boundary. In Section 9.2, we introduce 
the support vector classifier, an extension of the maximal margin classifier 
that can be applied in a broader range of cases. Section 9.3 introduces the 
support vector machine, which is a further extension of the support vec¬ 
tor classifier in order to accommodate non-linear class boundaries. Support 
vector machines are intended for the binary classification setting in which 
there are two classes; in Section 9.4 we discuss extensions of support vector 
machines to the case of more than two classes. In Section 9.5 we discuss 
the close connections between support vector machines and other statistical 
methods such as logistic regression. 

People often loosely refer to the maximal margin classifier, the support 
vector classifier, and the support vector machine as “support vector 
machines”. To avoid confusion, we will carefully distinguish between these 
three notions in this chapter. 
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9.1 Maximal Margin Classifier 

In this section, we define a hyperplane and introduce the concept of an 
optimal separating hyperplane. 


9.1.1 What Is a Hyperplane? 

In a p-dimensional space, a hyperplane is a flat affine subspace of 
dimension p — l. 1 For instance, in two dimensions, a hyperplane is a flat 
one-dimensional subspace—in other words, a line. In three dimensions, a 
hyperplane is a flat two-dimensional subspace—that is, a plane. In p > 3 
dimensions, it can be hard to visualize a hyperplane, but the notion of a 
(p — l)-dimensional flat subspace still applies. 

The mathematical definition of a hyperplane is quite simple. In two di¬ 
mensions, a hyperplane is defined by the equation 

po+0 1 X 1 +foX2=O (9.1) 

for parameters 0o,0i, and 02- When we say that (9.1) “defines” the hyper¬ 
plane, we mean that any X = (Xi,X 2 ) T for which (9.1) holds is a point 
on the hyperplane. Note that (9.1) is simply the equation of a line, since 
indeed in two dimensions a hyperplane is a line. 

Equation 9.1 can be easily extended to the p-dimensional setting: 


0o + 0iX! + 0 2 x 2 + ... + 0 P X P = 0 (9.2) 

defines a p-dimensional hyperplane, again in the sense that if a point X = 
{Xi , X 2 ,..., X p ) T in p-dimensional space (i.e. a vector of length p) satisfies 
(9.2), then X lies on the hyperplane. 

Now, suppose that X does not satisfy (9.2); rather, 


0o + 0\X\ + 0 2 X 2 + ... + 0 p X p > 0. (9-3) 

Then this tells us that X lies to one side of the hyperplane. On the other 
hand, if 

00 + 0iX\ + 0 2 X 2 + ... + 0 p X p < 0, (9-4) 

then X lies on the other side of the hyperplane. So we can think of the 
hyperplane as dividing p-dimensional space into two halves. One can easily 
determine on which side of the hyperplane a point lies by simply calculating 
the sign of the left hand side of (9.2). A hyperplane in two-dimensional 
space is shown in Figure 9.1. 


1 The word affine indicates that the subspace need not pass through the origin. 


hyperplane 
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FIGURE 9.1. The hyperplane 1 + 2A'i + 3 X 2 = 0 is shown. The blue region is 
the set of points for which 1 + 2 Xi + 3X2 > 0 , and the purple region is the set of 
points for which 1 + 2 Xi + 3X2 < 0. 


9.1.2 Classification Using a Separating Hyperplane 

Now suppose that we have a nxp data matrix X that consists of n training 
observations in p-dimensional space, 



( X11 \ 


Xi = 

• I j • • •1 x n = 

: 


\X\p) 

\%np ) 


and that these observations fall into two classes—that is, yi,...,y n G 
{—1,1} where —1 represents one class and 1 the other class. We also have a 
test observation, a p -vector of observed features x* = (a;* ... x*) T . Our 

goal is to develop a classifier based on the training data that will correctly 
classify the test observation using its feature measurements. We have seen 
a number of approaches for this task, such as linear discriminant analysis 
and logistic regression in Chapter 4, and classification trees, bagging, and 
boosting in Chapter 8. We will now see a new approach that is based upon 
the concept of a separating hyperplane. 

Suppose that it is possible to construct a hyperplane that separates the 
training observations perfectly according to their class labels. Examples 
of three such separating hyperplanes are shown in the left-hand panel of 
Figure 9.2. We can label the observations from the blue class as yi = 1 and 
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FIGURE 9.2. Left: There are two classes of observations, shown in blue and 
in purple, each of which has measurements on two variables. Three separating 
hyperplanes, out of many possible, are shown in black. Right: A separating hy¬ 
perplane is shown in black. The blue and purple grid indicates the decision rule 
made by a classifier based on this separating hyperplane: a test observation that 
falls in the blue portion of the grid will be assigned to the blue class, and a test 
observation that falls into the purple portion of the grid will be assigned to the 
purple class. 

those from the purple class as y, = — 1. Then a separating hyperplane has 
the property that 


A) + PlXil + p2Xi2 + ■ 

• • + fipXip ^ 0 if yi — 1, 

(9.6) 

A) + PlXn + (3 2 Xi2 + ■ • 

. + f3pX ip < 0 if yi = -1. 

(9.7) 


Equivalently, a separating hyperplane has the property that 

Ui(Po + PiXn + p 2 x i2 + ■ • - + PpXip) > 0 (9.8) 


for all i = 1 ,..., n. 

If a separating hyperplane exists, we can use it to construct a very natural 
classifier: a test observation is assigned a class depending on which side of 
the hyperplane it is located. The right-hand panel of Figure 9.2 shows 
an example of such a classifier. That is, we classify the test observation x* 
based on the sign of f(x*) = A>+ PiX^A- ^ 2^2 + • • -+Pp x p- If f(x*) is positive, 
then we assign the test observation to class 1, and if /( x*) is negative, then 
we assign it to class —1. We can also make use of the magnitude of f(x*). If 
f{x*) is far from zero, then this means that x* lies far from the hyperplane, 
and so we can be confident about our class assignment for x*. On the other 
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hand, if fix*) is close to zero, then x* is located near the hyperplane, and so 
we are less certain about the class assignment for x* . Not surprisingly, and 
as we see in Figure 9.2, a classifier that is based on a separating hyperplane 
leads to a linear decision boundary. 

9.1.3 The Maximal Margin Classifier 

In general, if our data can be perfectly separated using a hyperplane, then 
there will in fact exist an infinite number of such hyperplanes. This is 
because a given separating hyperplane can usually be shifted a tiny bit up or 
down, or rotated, without coming into contact with any of the observations. 
Three possible separating hyperplanes are shown in the left-hand panel 
of Figure 9.2. In order to construct a classifier based upon a separating 
hyperplane, we must have a reasonable way to decide which of the infinite 
possible separating hyperplanes to use. 

A natural choice is the maximal margin hyperplane (also known as the 
optimal separating hyperplane ), which is the separating hyperplane that 
is farthest from the training observations. That is, we can compute the 
(perpendicular) distance from each training observation to a given separat¬ 
ing hyperplane; the smallest such distance is the minimal distance from the 
observations to the hyperplane, and is known as the margin. The maximal 
margin hyperplane is the separating hyperplane for which the margin is 
largest—that is, it is the hyperplane that has the farthest minimum dis¬ 
tance to the training observations. We can then classify a test observation 
based on which side of the maximal margin hyperplane it lies. This is known 
as the maximal margin classifier. We hope that a classifier that has a large 
margin on the training data will also have a large margin on the test data, 
and hence will classify the test observations correctly. Although the maxi¬ 
mal margin classifier is often successful, it can also lead to overfitting when 
p is large. 

If /? 0 ; Pi, ■ ■ ■, Pp are the coefficients of the maximal margin hyperplane, 
then the maximal margin classifier classifies the test observation x* based 
on the sign of /( x*) = ho + hix\ + fox? + ... + fi p x* p . 

Figure 9.3 shows the maximal margin hyperplane on the data set of 
Figure 9.2. Comparing the right-hand panel of Figure 9.2 to Figure 9.3, 
we see that the maximal margin hyperplane shown in Figure 9.3 does in¬ 
deed result in a greater minimal distance between the observations and the 
separating hyperplane- that is, a larger margin. In a sense, the maximal 
margin hyperplane represents the mid-line of the widest “slab” that we can 
insert between the two classes. 

Examining Figure 9.3, we see that three training observations are equidis¬ 
tant from the maximal margin hyperplane and lie along the dashed lines 
indicating the width of the margin. These three observations are known as 


maximal 

margin 

hyperplane 

optimal 

separating 

hyperplane 

margin 


maximal 

margin 

classifier 


342 


9. Support Vector Machines 



FIGURE 9.3. There are two classes of observations, shown in blue and in pur¬ 
ple. The maximal margin hyperplane is shown as a solid line. The margin is the 
distance from the solid line to either of the dashed lines. The two blue points 
and the purple point that lie on the dashed lines are the support vectors, and the 
distance from those points to the margin is indicated by arrows. The purple and 
blue grid indicates the decision rule made by a classifier based on this separating 
hyperplane. 


support vectors , since they are vectors in p-dimensional space (in Figure 9.3, 
p = 2) and they “support” the maximal margin hyperplane in the sense 
that if these points were moved slightly then the maximal margin hyper¬ 
plane would move as well. Interestingly, the maximal margin hyperplane 
depends directly on the support vectors, but not on the other observations: 
a movement to any of the other observations would not affect the separating 
hyperplane, provided that the observation’s movement does not cause it to 
cross the boundary set by the margin. The fact that the maximal margin 
hyperplane depends directly on only a small subset of the observations is 
an important property that will arise later in this chapter when we discuss 
the support vector classifier and support vector machines. 


9.1.4 Construction of the Maximal Margin Classifier 

We now consider the task of constructing the maximal margin hyperplane 
based on a set of n training observations Xi,... ,x n € R p and associated 
class labels y\,... ,y n £ {—1,1}- Briefly, the maximal margin hyperplane 
is the solution to the optimization problem 


support 

vector 
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maximize M 

Po,Pi,—,Pp 

(9.9) 

P 

subject to Pj = 1, 

(9.10) 


i =1 


Vi(Po + P\x%i + P 2 X 12 + • ■ • + /3pX ip ) > M V i = 1,..., n. (9.11) 

This optimization problem (9.9)-(9.11) is actually simpler than it looks. 
First of all, the constraint in (9.11) that 

Vi(Po + PiXn + fcxa + ... + PpXip) > M V i = 1,..., n 

guarantees that each observation will be on the correct side of the hyper¬ 
plane, provided that M is positive. (Actually, for each observation to be on 
the correct side of the hyperplane we would simply need yi(Po + PiXn + 
hxi 2 + - ■ ■+(3pXi P ) > 0, so the constraint in (9.11) in fact requires that each 
observation be on the correct side of the hyperplane, with some cushion, 
provided that M is positive.) 

Second, note that (9.10) is not really a constraint on the hyperplane, since 
if Po + PiXn + p 2 %i 2 + .. • + ftpXip = 0 defines a hyperplane, then so does 
k(/3 o + fiiXn + @ 2 Xi 2 + . -. + PpXip) = 0 for any k ^ 0. However, (9.10) adds 
meaning to (9.11); one can show that with this constraint the perpendicular 
distance from the ith observation to the hyperplane is given by 

Vi(P 0 + PlXil + PlXi2 + ■ • ■ + PpXip). 

Therefore, the constraints (9.10) and (9.11) ensure that each observation 
is on the correct side of the hyperplane and at least a distance M from the 
hyperplane. Hence, M represents the margin of our hyperplane, and the 
optimization problem chooses Po, Pi, ■ ■ ■, P p to maximize M. This is exactly 
the definition of the maximal margin hyperplane! The problem (9.9)-(9.11) 
can be solved efficiently, but details of this optimization are outside of the 
scope of this book. 

9.1.5 The Non-separable Case 

The maximal margin classifier is a very natural way to perform classifi¬ 
cation, if a separating hyperplane exists. However, as we have hinted, in 
many cases no separating hyperplane exists, and so there is no maximal 
margin classifier. In this case, the optimization problem (9.9)-(9.11) has no 
solution with M > 0. An example is shown in Figure 9.4. In this case, we 
cannot exactly separate the two classes. However, as we will see in the next 
section, we can extend the concept of a separating hyperplane in order to 
develop a hyperplane that almost separates the classes, using a so-called 
soft margin. The generalization of the maximal margin classifier to the 
non-separable case is known as the support vector classifier. 
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Xi 


FIGURE 9.4. There are two classes of observations, shown in blue and in pur¬ 
ple. In this case , the two classes are not separable by a hyperplane, and so the 
maximal margin classifier cannot be used. 


9.2 Support Vector Classifiers 

9.2.1 Overview of the Support Vector Classifier 

In Figure 9.4, we see that observations that belong to two classes are not 
necessarily separable by a hyperplane. In fact, even if a separating hyper¬ 
plane does exist, then there are instances in which a classifier based on 
a separating hyperplane might not be desirable. A classifier based on a 
separating hyperplane will necessarily perfectly classify all of the training 
observations; this can lead to sensitivity to individual observations. An ex¬ 
ample is shown in Figure 9.5. The addition of a single observation in the 
right-hand panel of Figure 9.5 leads to a dramatic change in the maxi¬ 
mal margin hyperplane. The resulting maximal margin hyperplane is not 
satisfactory—for one thing, it has only a tiny margin. This is problematic 
because as discussed previously, the distance of an observation from the 
hyperplane can be seen as a measure of our confidence that the obser¬ 
vation was correctly classified. Moreover, the fact that the maximal mar¬ 
gin hyperplane is extremely sensitive to a change in a single observation 
suggests that it may have overfit the training data. 

In this case, we might be willing to consider a classifier based on a hy¬ 
perplane that does not perfectly separate the two classes, in the interest of 
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Vi 


Vi 


FIGURE 9.5. Left: Two classes of observations are shown in blue and in 
purple, along with the maximal margin hyperplane. Right: An additional blue 
observation has been added, leading to a dramatic shift in the maximal margin 
hyperplane shown as a solid line. The dashed line indicates the maximal margin 
hyperplane that was obtained in the absence of this additional point. 


• Greater robustness to individual observations, and 

• Better classification of most of the training observations. 


That is, it could be worthwhile to misclassify a few training observations 
in order to do a better job in classifying the remaining observations. 

The support vector classifier , sometimes called a soft margin classifier, 
does exactly this. Rather than seeking the largest possible margin so that 
every observation is not only on the correct side of the hyperplane but 
also on the correct side of the margin, we instead allow some observations 
to be on the incorrect side of the margin, or even the incorrect side of 
the liyperplane. (The margin is soft because it can be violated by some 
of the training observations.) An example is shown in the left-hand panel 
of Figure 9.6. Most of the observations are on the correct side of the margin. 
However, a small subset of the observations are on the wrong side of the 
margin. 

An observation can be not only on the wrong side of the margin, but also 
on the wrong side of the hyperplane. In fact, when there is no separating 
hyperplane, such a situation is inevitable. Observations on the wrong side of 
the hyperplane correspond to training observations that are misclassified by 
the support vector classifier. The right-hand panel of Figure 9.6 illustrates 
such a scenario. 


support 
vector 
classifier 
soft margin 
classifier 


9.2.2 Details of the Support Vector Classifier 

The support vector classifier classifies a test observation depending on 
which side of a hyperplane it lies. The hyperplane is chosen to correctly 
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FIGURE 9.6. Left: A support vector classifier was fit to a small data set. The 
hyperplane is shown as a solid line and the margins are shown as dashed lines. 
Purple observations: Observations 3,4, 5, and 6 are on the correct side of the 
margin, observation 2 is on the margin , and observation 1 is on the wrong side of 
the margin. Blue observations: Observations 7 and 10 are on the correct side of 
the margin, observation 9 is on the margin, and observation 8 is on the wrong side 
of the margin. No observations are on the wrong side of the hyperplane. Right: 
Same as left panel with two additional points, 11 and 12. These two observations 
are on the wrong side of the hyperplane and the wrong side of the margin. 

separate most of the training observations into the two classes, but may 
misclassify a few observations. It is the solution to the optimization problem 

maximize M (9-12) 

v 

subject to E $ = 1, (9-13) 

JM 

l /i(A) + PiXil + P 2 X 12 + • • • + fipXip) > M( 1 - £j), (9.14) 

n 

ei > 0, ^ei<C, (9.15) 

i=l 

where C is a nonnegative tuning parameter. As in (9.11), M is the width 
of the margin; we seek to make this quantity as large as possible. In (9.14), 
ei,... ,e n are slack variables that allow individual observations to be on 
the wrong side of the margin or the hyperplane; we will explain them in 
greater detail momentarily. Once we have solved (9.12)—(9.15), we classify 
a test observation x* as before, by simply determining on which side of the 
hyperplane it lies. That is, we classify the test observation based on the 
sign of /( x*) = Po + fiixl + ... + /3 p x*. 

The problem (9.12)-(9.15) seems complex, but insight into its behavior 
can be made through a series of simple observations presented below. First 
of all, the slack variable tells us where the *th observation is located, 
relative to the hyperplane and relative to the margin. If e* = 0 then the itli 


slack 

variable 
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observation is on the correct side of the margin, as we saw in Section 9.1.4. 
If > 0 then the ith observation is on the wrong side of the margin, and 
we say that the zth observation has violated the margin. If ti > 1 then it 
is on the wrong side of the hyperplane. 

We now consider the role of the tuning parameter C. In (9.14), C bounds 
the sum of the ej’s, and so it determines the number and severity of the vio¬ 
lations to the margin (and to the hyperplane) that we will tolerate. We can 
think of C as a budget for the amount that the margin can be violated 
by the n observations. If C = 0 then there is no budget for violations to 
the margin, and it must be the case that ei = ... = e n = 0, in which case 
(9.12)-(9.15) simply amounts to the maximal margin hyperplane optimiza¬ 
tion problem (9.9)-(9.11). (Of course, a maximal margin hyperplane exists 
only if the two classes are separable.) For C > 0 no more than C observa¬ 
tions can be on the wrong side of the hyperplane, because if an observation 
is on the wrong side of the hyperplane then e* > 1, and (9.14) requires 
that Y^i =i e i — C- As the budget C increases, we become more tolerant of 
violations to the margin, and so the margin will widen. Conversely, as C 
decreases, we become less tolerant of violations to the margin and so the 
margin narrows. An example in shown in Figure 9.7. 

In practice, C is treated as a tuning parameter that is generally chosen via 
cross-validation. As with the tuning parameters that we have seen through¬ 
out this book, C controls the bias-variance trade-off of the statistical learn¬ 
ing technique. When C is small, we seek narrow margins that are rarely 
violated; this amounts to a classifier that is highly fit to the data, which 
may have low bias but high variance. On the other hand, when C is larger, 
the margin is wider and we allow more violations to it; this amounts to 
fitting the data less hard and obtaining a classifier that is potentially more 
biased but may have lower variance. 

The optimization problem (9.12)-(9.15) has a very interesting property: 
it turns out that only observations that either lie on the margin or that 
violate the margin will affect the hyperplane, and hence the classifier ob¬ 
tained. In other words, an observation that lies strictly on the correct side 
of the margin does not affect the support vector classifier! Changing the 
position of that observation would not change the classifier at all, provided 
that its position remains on the correct side of the margin. Observations 
that lie directly on the margin, or on the wrong side of the margin for 
their class, are known as support vectors. These observations do affect the 
support vector classifier. 

The fact that only support vectors affect the classifier is in line with our 
previous assertion that C controls the bias-variance trade-off of the support 
vector classifier. When the tuning parameter C is large, then the margin is 
wide, many observations violate the margin, and so there are many support 
vectors. In this case, many observations are involved in determining the 
hyperplane. The top left panel in Figure 9.7 illustrates this setting: this 
classifier has low variance (since many observations are support vectors) 
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FIGURE 9.7. A support vector classifier was fit using four different values of the 
tuning parameter C in (9.12)-(9.15). The largest value of C was used in the top 
left panel, and smaller values were used in the top right, bottom left, and bottom 
right panels. When C is large, then there is a high tolerance for observations being 
on the wrong side of the margin, and so the margin will be large. As C decreases, 
the tolerance for observations being on the wrong side of the margin decreases, 
and the margin narrows. 


but potentially high bias. In contrast, if C is small, then there will be fewer 
support vectors and hence the resulting classifier will have low bias but 
high variance. The bottom right panel in Figure 9.7 illustrates this setting, 
with only eight support vectors. 

The fact that the support vector classifier’s decision rule is based only 
on a potentially small subset of the training observations (the support vec¬ 
tors) means that it is quite robust to the behavior of observations that 
are far away from the hyperplane. This property is distinct from some of 
the other classification methods that we have seen in preceding chapters, 
such as linear discriminant analysis. Recall that the LDA classification rule 
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FIGURE 9.8 Left: The observations fall into two classes, with a non-linear 
boundary between them. Right: The support vector classifier seeks a linear bound¬ 
ary, and consequently performs very poorly. 


depends on the mean of all of the observations within each class, as well as 
the within-class covariance matrix computed using all of the observations. 
In contrast, logistic regression, unlike LDA, has very low sensitivity to ob¬ 
servations far from the decision boundary. In fact we will see in Section 9.5 
that the support vector classifier and logistic regression are closely related. 


9.3 Support Vector Machines 

We first discuss a general mechanism for converting a linear classifier into 
one that produces non-linear decision boundaries. We then introduce the 
support vector machine, which does this in an automatic way. 


9.3.1 Classification with Non-linear Decision Boundaries 

The support vector classifier is a natural approach for classification in the 
two-class setting, if the boundary between the two classes is linear. How¬ 
ever, in practice we are sometimes faced with non-linear class boundaries. 
For instance, consider the data in the left-hand panel of Figure 9.8. It is 
clear that a support vector classifier or any linear classifier will perform 
poorly here. Indeed, the support vector classifier shown in the right-hand 
panel of Figure 9.8 is useless here. 

In Chapter 7, we are faced with an analogous situation. We see there 
that the performance of linear regression can suffer when there is a non¬ 
linear relationship between the predictors and the outcome. In that case, 
we consider enlarging the feature space using functions of the predictors, 
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such as quadratic and cubic terms, in order to address this non-linearity. 
In the case of the support vector classifier, we could address the prob¬ 
lem of possibly non-linear boundaries between classes in a similar way, by 
enlarging the feature space using quadratic, cubic, and even higher-order 
polynomial functions of the predictors. For instance, rather than fitting a 
support vector classifier using p features 

A'i, X2, • ■ •, X p , 

we could instead fit a support vector classifier using 2 p features 

V \r2 v V"2 \r \r2 

A l> -*1 > -*2 j * * * > 

Then (9.12)-(9.15) would become 

maximize M (9.16) 

( P V \ 

/3o++ E^4 > Af(i - 
j=i j=i y 

n p 2 

^EE^ = 1 - 

i=l j=l fe=l 

Why does this lead to a non-linear decision boundary? In the enlarged 
feature space, the decision boundary that results from (9.16) is in fact lin¬ 
ear. But in the original feature space, the decision boundary is of the form 
q{x) = 0, where q is a quadratic polynomial, and its solutions are gener¬ 
ally non-linear. One might additionally want to enlarge the feature space 
with higher-order polynomial terms, or with interaction terms of the form 
XjXj> for j ^ j'. Alternatively, other functions of the predictors could 
be considered rather than polynomials. It is not hard to see that there 
are many possible ways to enlarge the feature space, and that unless we 
are careful, we could end up with a huge number of features. Then compu¬ 
tations would become unmanageable. The support vector machine, which 
we present next, allows us to enlarge the feature space used by the support 
vector classifier in a way that leads to efficient computations. 

9.3.2 The Support Vector Machine 

The support vector machine (SVM) is an extension of the support vector 
classifier that results from enlarging the feature space in a specific way, 
using kernels. We will now discuss this extension, the details of which are 
somewhat complex and beyond the scope of this book. However, the main 
idea is described in Section 9.3.1: we may want to enlarge our feature space 


support 
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machine 

kernel 
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in order to accommodate a non-linear boundary between the classes. The 
kernel approach that we describe here is simply an efficient computational 
approach for enacting this idea. 

We have not discussed exactly how the support vector classifier is com¬ 
puted because the details become somewhat technical. However, it turns 
out that the solution to the support vector classifier problem (9.12)-(9.15) 
involves only the inner products of the observations (as opposed to the 
observations themselves). The inner product of two r -vectors a and b is 
defined as (a, b) = a i^i- Thus the inner product of two observations 
a;*, Xi> is given by 

p 

(Xj,Xj>) = y(9-17) 
i 


It can be shown that 


• The linear support vector classifier can be represented as 

n 

f (x) = fio + ^2 ai(x,Xi), (9.18) 

i= 1 

where there are n parameters ai, i = 1 ,...,n, one per training 
observation. 

• To estimate the parameters oq,... and 0o, all we need are the 
( 2 ) inner products ( Xi , xp) between all pairs of training observations. 
(The notation (!() means n{n — l)/2, and gives the number of pairs 
among a set of n items.) 

Notice that in (9.18), in order to evaluate the function f(x), we need to 
compute the inner product between the new point x and each of the training 
points Xi . However, it turns out that a, is nonzero only for the support 
vectors in the solution—that is, if a training observation is not a support 
vector, then its at equals zero. So if S is the collection of indices of these 
support points, we can rewrite any solution function of the form (9.18) as 

f(x) = fio + ^2 Ui{x,Xi), (9.19) 

«£<S 

which typically involves far fewer terms than in (9.18). 2 

To summarize, in representing the linear classifier f(x), and in computing 
its coefficients, all we need are inner products. 

Now suppose that every time the inner product (9.17) appears in the 
representation (9.18), or in a calculation of the solution for the support 


2 By expanding each of the inner products in (9.19), it is easy to see that f{x) is 
a linear function of the coordinates of x. Doing so also establishes the correspondence 
between the ai and the original parameters /3j. 
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vector classifier, we replace it with a generalization of the inner product of 
the form 

K(xi,Xi'), (9.20) 

where K is some function that we will refer to as a kernel. A kernel is a 
function that quantifies the similarity of two observations. For instance, we 
could simply take 

p 

(Xl• Xi■ ) — ^ ' x ij x i'j ; (9.21) 

0 =1 

which would just give us back the support vector classifier. Equation 9.21 
is known as a linear kernel because the support vector classifier is linear 
in the features; the linear kernel essentially quantifies the similarity of a 
pair of observations using Pearson (standard) correlation. But one could 
instead choose another form for (9.20). For instance, one could replace 
every instance of Y^ P j=i x ij x i'j with the quantity 

p 

K(xi,Xi‘) = (1 + y ^XijXi>j) d . (9.22) 

l=i 

This is known as a polynomial kernel of degree d , where d is a positive 
integer. Using such a kernel with d > 1, instead of the standard linear 
kernel (9.21), in the support vector classifier algorithm leads to a much more 
flexible decision boundary. It essentially amounts to fitting a support vector 
classifier in a higher-dimensional space involving polynomials of degree d, 
rather than in the original feature space. When the support vector classifier 
is combined with a non-linear kernel such as (9.22), the resulting classifier is 
known as a support vector machine. Note that in this case the (non-linear) 
function has the form 


f(x) = Po + E anK(x,Xi). (9.23) 

ies 

The left-hand panel of Figure 9.9 shows an example of an SVM with a 
polynomial kernel applied to the non-linear data from Figure 9.8. The fit is 
a substantial improvement over the linear support vector classifier. When 
d = 1, then the SVM reduces to the support vector classifier seen earlier in 
this chapter. 

The polynomial kernel shown in (9.22) is one example of a possible 
non-linear kernel, but alternatives abound. Another popular choice is the 
radial kernel , which takes the form 

p 

K(xi,Xi>) = exp(— 7 ^^( x ij -Xi’j) 2 ). (9.24) 

j=i 
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Vi Vj 

FIGURE 9.9. Left: An SVM with a polynomial kernel of degree 3 is applied to 
the non-linear data from Figure 9.8, resulting in a far more appropriate decision 
rule. Right: An SVM with a radial kernel is applied. In this example, either kernel 
is capable of capturing the decision boundary. 


In (9.24), 7 is a positive constant. The right-hand panel of Figure 9.9 shows 
an example of an SVM with a radial kernel on this non-linear data; it also 
does a good job in separating the two classes. 

How does the radial kernel (9.24) actually work? If a given test obser¬ 
vation x* = (x* ... x*) T is far from a training observation Xi in terms of 
Euclidean distance, then — x ij) 2 w dl be large, and so K(x*, Xi) = 

exp (—7 Y^'j=i( x j ~ x ij) 2 ) will be very tiny. This means that in (9.23), Xi 
will play virtually no role in f(x*). Recall that the predicted class label 
for the test observation x* is based on the sign of f(x*). In other words, 
training observations that are far from x* will play essentially no role in 
the predicted class label for x*. This means that the radial kernel has very 
local behavior, in the sense that only nearby training observations have an 
effect on the class label of a test observation. 

What is the advantage of using a kernel rather than simply enlarging 
the feature space using functions of the original features, as in (9.16)? One 
advantage is computational, and it amounts to the fact that using kernels, 
one need only compute K(xi , x^) for all Q) distinct pairs i, i'. This can be 
done without explicitly working in the enlarged feature space. This is im¬ 
portant because in many applications of SVMs, the enlarged feature space 
is so large that computations are intractable. For some kernels, such as the 
radial kernel (9.24), the feature space is implicit and infinite-dimensional, 
so we could never do the computations there anyway! 
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FIGURE 9.10. ROC curves for the Heart data training set. Left: The support 
vector classifier and LDA are compared. Right: The support vector classifier is 
compared to an SVM using a radial basis kernel with 7 = 10~ 3 , 10~ 2 , and 10 _1 . 


9.3.3 An Application to the Heart Disease Data 

In Chapter 8 we apply decision trees and related methods to the Heart data. 
The aim is to use 13 predictors such as Age, Sex, and Choi in order to predict 
whether an individual has heart disease. We now investigate how an SVM 
compares to LDA on this data. After removing 6 missing observations, the 
data consist of 297 subjects, which we randomly split into 207 training and 
90 test observations. 

We first fit LDA and the support vector classifier to the training data. 
Note that the support vector classifier is equivalent to a SVM using a poly¬ 
nomial kernel of degree d—1. The left-hand panel of Figure 9.10 displays 
ROC curves (described in Section 4.4.3) for the training set predictions for 
both LDA and the support vector classifier. Both classifiers compute scores 
of the form /( X) = $0 + j3\Xi + $ 2 X 2 + ■ ■ . + $ p X p for each observation. 
For any given cutoff t , we classify observations into the heart disease or 
no heart disease categories depending on whether /(X) < t, or /(X) > t. 
The ROC curve is obtained by forming these predictions and computing 
the false positive and true positive rates for a range of values of t. An opti¬ 
mal classifier will hug the top left corner of the ROC plot. In this instance 
LDA and the support vector classifier both perform well, though there is a 
suggestion that the support vector classifier may be slightly superior. 

The right-hand panel of Figure 9.10 displays ROC curves for SVMs using 
a radial kernel, with various values of 7 . As 7 increases and the fit becomes 
more non-linear, the ROC curves improve. Using 7 = 10 _1 appears to give 
an almost perfect ROC curve. However, these curves represent training 
error rates, which can be misleading in terms of performance on new test 
data. Figure 9.11 displays ROC curves computed on the 90 test observa- 
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FIGURE 9.11. ROC curves for the test set of the Heart data. Left: The support 
vector classifier and LDA are compared. Right: The support vector classifier is 
compared to an SVM using a radial basis kernel with 7 = 1CU 3 , 1CU 2 , and 1CU 1 . 


tions. We observe some differences from the training ROC curves. In the 
left-hand panel of Figure 9.11, the support vector classifier appears to have 
a small advantage over LDA (although these differences are not statisti¬ 
cally significant). In the right-hand panel, the SVM using 7 = 10 _1 , which 
showed the best results on the training data, produces the worst estimates 
on the test data. This is once again evidence that while a more flexible 
method will often produce lower training error rates, this does not neces¬ 
sarily lead to improved performance on test data. The SVMs with 7 = 1CD 2 
and 7 = 10 -3 perform comparably to the support vector classifier, and all 
three outperform the SVM with 7 = 10 * 


9.4 SVMs with More than Two Classes 

So far, our discussion has been limited to the case of binary classification: 
that is, classification in the two-class setting. How can we extend SVMs 
to the more general case where we have some arbitrary number of classes? 

It turns out that the concept of separating hyperplanes upon which SVMs 
are based does not lend itself naturally to more than two classes. Though 
a number of proposals for extending SVMs to the Jv-class case have been 
made, the two most popular are the one-versus-one and one-versus-all 
approaches. We briefly discuss those two approaches here. 

9-4-1 One-Versus-One Classification 

Suppose that we would like to perform classification using SVMs, and there 

are K > 2 classes. A one-versus-one or all-pairs approach constructs (^”) one _ versus . 


one 
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SVMs, each of which compares a pair of classes. For example, one such 
SVM might compare the fcth class, coded as +1, to the fc'th class, coded 
as —1. We classify a test observation using each of the (^) classifiers, and 
we tally the number of times that the test observation is assigned to each 
of the K classes. The final classification is performed by assigning the test 
observation to the class to which it was most frequently assigned in these 
(fi'j pairwise classifications. 

9-4-2 One-Versus-All Classification 

The one-versus-all approach is an alternative procedure for applying SVMs 
in the case of K > 2 classes. We fit K SVMs, each time comparing one of ail 
the K classes to the remaining K — 1 classes. Let Pok, Pi fc, ■ • •, Ppk denote 
the parameters that result from fitting an SVM comparing the fcth class 
(coded as +1) to the others (coded as —1). Let x* denote a test observation. 

We assign the observation to the class for which /3ofc +Pikx\ -\-p 2 kX 2 + • ■ • + 
PpkXp is largest, as this amounts to a high level of confidence that the test 
observation belongs to the fcth class rather than to any of the other classes. 


9.5 Relationship to Logistic Regression 



When SVMs were first introduced in the mid-1990s, they made quite a 
splash in the statistical and machine learning communities. This was due 
in part to their good performance, good marketing, and also to the fact 
that the underlying approach seemed both novel and mysterious. The idea 
of finding a hyperplane that separates the data as well as possible, while al¬ 
lowing some violations to this separation, seemed distinctly different from 
classical approaches for classification, such as logistic regression and lin¬ 
ear discriminant analysis. Moreover, the idea of using a kernel to expand 
the feature space in order to accommodate non-linear class boundaries ap¬ 
peared to be a unique and valuable characteristic. 

However, since that time, deep connections between SVMs and other 
more classical statistical methods have emerged. It turns out that one can 
rewrite the criterion (9.12)-(9.15) for fitting the support vector classifier 
/(V) = Po + p\X\ + ... + PpXp as 



(9.25) 


9.5 Relationship to Logistic Regression 357 


where A is a nonnegative tuning parameter. When A is large then /3i,..., /3 P 
are small, more violations to the margin are tolerated, and a low-variance 
but high-bias classifier will result. When A is small then few violations 
to the margin will occur; this amounts to a high-variance but low-bias 
classifier. Thus, a small value of A in (9.25) amounts to a small value of C 
in (9.15). Note that the A l term in (9.25) is the ridge penalty term 
from Section 6.2.1, and plays a similar role in controlling the bias-variance 
trade-off for the support vector classifier. 

Now (9.25) takes the “Loss + Penalty” form that we have seen repeatedly 
throughout this book: 

minimize {L(X, y, /3) + AP(/3)} . (9.26) 

Po,Pi,---,Pp 

In (9.26), L(X,y,/3) is some loss function quantifying the extent to which 
the model, parametrized by /?, fits the data (X, y), and P(/3) is a penalty 
function on the parameter vector j3 whose effect is controlled by a nonneg¬ 
ative tuning parameter A. For instance, ridge regression and the lasso both 
take this form with 

V 

x ijPj I 

and with P(f3) = i Pj for ridge regression and P{)3) = J2j =i \Pj\ f° r 
the lasso. In the case of (9.25) the loss function instead takes the form 

n 

L(X, y, /?) = ^2 max I 0 ’ 1 - Vi(Po + Pixn + ■ ■ ■ + PpXip)] ■ 

i—\ 

This is known as hinge loss , and is depicted in Figure 9.12. However, it 
turns out that the hinge loss function is closely related to the loss function 
used in logistic regression, also shown in Figure 9.12. 

An interesting characteristic of the support vector classifier is that only 
support vectors play a role in the classifier obtained; observations on the 
correct side of the margin do not affect it. This is due to the fact that the 
loss function shown in Figure 9.12 is exactly zero for observations for which 
yi(/3 o + PiXn + ... + PpXip) > 1; these correspond to observations that are 
on the correct side of the margin.' 1 In contrast, the loss function for logistic 
regression shown in Figure 9.12 is not exactly zero anywhere. But it is very 
small for observations that are far from the decision boundary. Due to the 
similarities between their loss functions, logistic regression and the support 
vector classifier often give very similar results. When the classes are well 
separated, SVMs tend to behave better than logistic regression; in more 
overlapping regimes, logistic regression is often preferred. 


L(X,y,0) = £ U-A)-£ 


2—1 


3= 1 


3 With this hinge-loss + penalty representation, the margin corresponds to the value 
one, and the width of the margin is determined by 0?. 


hinge loss 
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ViiPo /t| X- t ] -f- • • • “t 0pXip) 

FIGURE 9.12. The SVM and logistic regression loss functions are compared, 
as a function ofyi(/3o +/3ixn + ... +/3 p Xi p ). When yi{fto + fhxn + ... +/3 p Xi p ) is 
greater than 1, then the SVM loss is zero, since this corresponds to an observation 
that is on the correct side of the margin. Overall, the two loss functions have quite 
similar behavior. 


When the support vector classifier and SVM were first introduced, it was 
thought that the tuning parameter C in (9.15) was an unimportant “nui¬ 
sance” parameter that could be set to some default value, like 1. However, 
the “Loss + Penalty” formulation (9.25) for the support vector classifier 
indicates that this is not the case. The choice of tuning parameter is very 
important and determines the extent to which the model underfits or over¬ 
fits the data, as illustrated, for example, in Figure 9.7. 

We have established that the support vector classifier is closely related 
to logistic regression and other preexisting statistical methods. Is the SVM 
unique in its use of kernels to enlarge the feature space to accommodate 
non-linear class boundaries? The answer to this question is “no”. We could 
just as well perform logistic regression or many of the other classification 
methods seen in this book using non-linear kernels; this is closely related 
to some of the non-linear approaches seen in Chapter 7. However, for his¬ 
torical reasons, the use of non-linear kernels is much more widespread in 
the context of SVMs than in the context of logistic regression or other 
methods. 

Though we have not addressed it here, there is in fact an extension 
of the SVM for regression (i.e. for a quantitative rather than a qualita¬ 
tive response), called support vector regression. In Chapter 3, we saw that 
least squares regression seeks coefficients /3o, Ph ■ ■ ■, P p such that the sum 
of squared residuals is as small as possible. (Recall from Chapter 3 that 
residuals are defined as yi — /?o — /3iXa — ■ ■ ■ — /3 p Xi p .) Support vector 
regression instead seeks coefficients that minimize a different type of loss, 
where only residuals larger in absolute value than some positive constant 


support 

vector 

regression 
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contribute to the loss function. This is an extension of the margin used in 
support vector classifiers to the regression setting. 


9.6 Lab: Support Vector Machines 

We use the el071 library in R to demonstrate the support vector classifier 
and the SVM. Another option is the LiblineaR library, which is useful for 
very large linear problems. 


9.6.1 Support Vector Classifier 

The el071 library contains implementations for a number of statistical 
learning methods. In particular, the svm() function can be used to fit a 
support vector classifier when the argument kernel="linear" is used. This 
function uses a slightly different formulation from (9.14) and (9.25) for the 
support vector classifier. A cost argument allows us to specify the cost of 
a violation to the margin. When the cost argument is small, then the mar¬ 
gins will be wide and many support vectors will be on the margin or will 
violate the margin. When the cost argument is large, then the margins will 
be narrow and there will be few support vectors on the margin or violating 
the margin. 

We now use the svm() function to fit the support vector classifier for a 
given value of the cost parameter. Here we demonstrate the use of this 
function on a two-dimensional example so that we can plot the resulting 
decision boundary. We begin by generating the observations, which belong 
to two classes. 

> set.seed (1) 

> x=matrix(rnorm(20*2), ncol=2) 

> y = c(rep(-1 ,10) , rep(l,10)) 

> x [y==l , ] = x [y== 1 ,] + 1 

We begin by checking whether the classes are linearly separable. 

> plot(x, col=(3-y)) 

They are not. Next, we fit the support vector classifier. Note that in order 
for the svm() function to perform classification (as opposed to SVM-based 
regression), we must encode the response as a factor variable. We now 
create a data frame with the response coded as a factor. 

> dat=data.frame(x=x, y=as.factor(y)) 

> library(e1071) 

> svmfit = svm(y~. , data = dat , kernel = "linear" , cost = 10, 

scale = FALSE) 


svmO 
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The argument scale=FALSE tells the svm() function not to scale each feature 
to have mean zero or standard deviation one; depending on the application, 
one might prefer to use scale=TRUE. 

We can now plot the support vector classifier obtained: 

> plot(svmfit, dat) 

Note that the two arguments to the plot.svmO function are the output 
of the call to svm(), as well as the data used in the call to svm(). The 
region of feature space that will be assigned to the —1 class is shown in 
light blue, and the region that will be assigned to the +1 class is shown in 
purple. The decision boundary between the two classes is linear (because we 
used the argument kernel="linear"), though due to the way in which the 
plotting function is implemented in this library the decision boundary looks 
somewhat jagged in the plot. We see that in this case only one observation 
is misclassihed. (Note that here the second feature is plotted on the x-axis 
and the first feature is plotted on the y-axis, in contrast to the behavior of 
the usual plot() function in R.) The support vectors are plotted as crosses 
and the remaining observations are plotted as circles; we see here that there 
are seven support vectors. We can determine their identities as follows: 

> svmfit $index 

[1] 1 2 5 7 14 16 17 

We can obtain some basic information about the support vector classifier 
fit using the summary () command: 

> summary(svmfit ) 

Call : 

svm (formula = y ~ data = dat, kernel = "linear", cost = 10, 
scale = FALSE) 

Parameters : 

SVM-Type: C-classification 

SVM-Kernel: linear 

cost: 10 

gamma : 0.5 

Number of Support Vectors : 7 

(43) 

Number of Classes: 2 

Levels : 

-1 1 

This tells us, for instance, that a linear kernel was used with cost=10, and 
that there were seven support vectors, four in one class and three in the 
other. 

What if we instead used a smaller value of the cost parameter? 

> svmfit = svm(y~. , data = dat , kernel = "linear" , cost=0.1, 

scale =FALSE) 

> plot(svmfit, dat) 

> svmfit$index 

[1] 1 2 3 4 5 7 9 10 12 13 14 15 16 17 18 20 
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Now that a smaller value of the cost parameter is being used, we obtain a 
larger number of support vectors, because the margin is now wider. Unfor¬ 
tunately, the svm() function does not explicitly output the coefficients of 
the linear decision boundary obtained when the support vector classifier is 
fit, nor does it output the width of the margin. 

The el071 library includes a built-in function, tune(), to perform cross- 
validation. By default, tuneO performs ten-fold cross-validation on a set 
of models of interest. In order to use this function, we pass in relevant 
information about the set of models that are under consideration. The 
following command indicates that we want to compare SVMs with a linear 
kernel, using a range of values of the cost parameter. 

> set.seed (1) 

> tune.out = tune(svm,y~. ,data = dat,kernel = "linear", 

ranges = 1ist(cost = c (0.001, 0.01, 0.1, 1,5,10,100))) 


We can easily access the cross-validation errors for each of these models 
using the summary () command: 

> summary(tune.out) 

Parameter tuning of ’svm': 

- sampling method: 10-fold cross validation 

- best parameters: 
cost 

0.1 


- 

best 

perl 

iormance : ' 

0. 1 

- 

Detai 

led 

pei 

rformance r < 
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error 
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1 

1 e -03 

0 . 

. 70 

0 

.422 

2 

1 e -02 

0 . 

. 70 

0 

.422 

3 

le-01 

0 . 

. 10 

0 

.211 

4 

1 e +00 

0 . 

. 15 

0 

. 242 

5 

5e+00 

0 . 

. 15 

0 

. 242 

6 

le+01 

0 . 

. 15 

0 

. 242 

7 

1 e +02 

0 . 

. 15 

0 

. 242 


We see that cost=0.1 results in the lowest cross-validation error rate. The 
tuneO function stores the best model obtained, which can be accessed as 
follows: 

> bestmod=tune.out$best.model 

> summary(bestmod) 

The predict () function can be used to predict the class label on a set of 
test observations, at any given value of the cost parameter. We begin by 
generating a test data set. 

> xtest=matrix(rnorm (20*2) , ncol=2) 

> ytest=sample(c(-1,1), 20, rep=TRUE) 

> xtest[ytest==1,]=xtest[ytest==1,] + 1 

> testdat=data.frame(x=xtest, y=as.factor(ytest)) 

Now we predict the class labels of these test observations. Here we use the 
best model obtained through cross-validation in order to make predictions. 


362 


9. Support Vector Machines 


> ypred=predict(bestmod,testdat) 

> table(predict=ypred, truth=testdat$y) 

truth 

predict -1 1 

-1 11 1 
1 0 8 

Thus, with this value of cost, 19 of the test observations are correctly 
classified. What if we had instead used cost=0.01? 

> svmfit = svm(y~. , data = dat , kernel = "linear " , cost = .01, 

scale = FALSE) 

> ypred=predict(svmfit,testdat) 

> table(predict=ypred, truth=testdat$y) 

truth 

predict -1 1 

-1 11 2 
1 0 7 


In this case one additional observation is misclassified. 

Now consider a situation in which the two classes are linearly separable. 
Then we can find a separating hyperplane using the svm() function. We 
first further separate the two classes in our simulated data so that they are 
linearly separable: 

> x[y==l,]=x[y==l,]+0.5 

> plot(x, col=(y+5)/2, pch=19) 

Now the observations are just barely linearly separable. We fit the support 
vector classifier and plot the resulting hyperplane, using a very large value 
of cost so that no observations are misclassified. 


> dat=data.frame(x=x,y=as.factor(y)) 

> svmfit = svm(y~. , data = dat, kernel = "linear 

> summary(svmfit) 

Call : 

svm (formula = y ~ . , data = dat, kernel = 

+ 05) 


Parameters: 
SVM-Type: 
SVM-Kernel : 

cost : 
gamma: 


C-classification 

linear 

le + 05 

0.5 

3 


Number of Support Vectors : 

( 12 ) 


Number of Classes: 2 

Levels : 


, cost=le5) 


linear", cost 


-1 1 

> plot(svmfit, dat) 


1 e 


No training errors were made and only three support vectors were used. 
However, we can see from the figure that the margin is very narrow (because 
the observations that are not support vectors, indicated as circles, are very 
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close to the decision boundary). It seems likely that this model will perform 
poorly on test data. We now try a smaller value of cost: 

> svmfit = svm(y~. , data = dat, kernel = "1inear", cost = l) 

> summary(svmfit) 

> plot(svmfit,dat) 

Using cost=l, we misclassify a training observation, but we also obtain 
a much wider margin and make use of seven support vectors. It seems 
likely that this model will perform better on test data than the model with 
cost=le5. 


9.6.2 Support Vector Machine 

In order to fit an SVM using a non-linear kernel, we once again use the svm() 
function. However, now we use a different value of the parameter kernel. 
To fit an SVM with a polynomial kernel we use kernel="polynomial", and 
to fit an SVM with a radial kernel we use kernel="radial". In the former 
case we also use the degree argument to specify a degree for the polynomial 
kernel (this is d in (9.22)), and in the latter case we use gamma to specify a 
value of 7 for the radial basis kernel (9.24). 

We first generate some data with a non-linear class boundary, as follows: 

> set.seed (1) 

> x=matrix(rnorm(200*2), ncol=2) 

> x [1 : 100 ,]=x [1 : 100 ,] +2 

> x [101: 150 ,]=x [101: 150 ,] -2 

> y=c(rep(l,150),rep(2,50)) 

> dat=data.frame(x=x,y=as.factor(y)) 

Plotting the data makes it clear that the class boundary is indeed non¬ 
linear: 

> plot(x, col=y) 

The data is randomly split into training and testing groups. We then fit 
the training data using the svm() function with a radial kernel and 7 = 1 : 

> train = sample(200,100) 

> svmfit = svm(y~. , data = dat[train,] , kernel = "radial", gamma = l, 

cost =1) 

> plot ( svmf it , dat [train,]) 

The plot shows that the resulting SVM has a decidedly non-linear 
boundary. The summary () function can be used to obtain some 
information about the SVM fit: 

> summary(svmfit) 

Call : 

svm(formula = y ~ ., data = dat, kernel = "radial", 
gamma = 1, cost = 1) 

Parameters: 

SVM-Type: C-classification 
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SVM-Kernel: radial 

cost : 1 

gamma : 1 

Number of Support Vectors : 37 

( 17 20 ) 

Number of Classes : 2 

Levels : 

1 2 

We can see from the figure that there are a fair number of training errors 
in this SVM fit. If we increase the value of cost, we can reduce the number 
of training errors. However, this comes at the price of a more irregular 
decision boundary that seems to be at risk of overfitting the data. 

> svmfit = svm(y~. , data = dat [train,] , kernel = "radial",gamma = l, 

cost = 1e5) 

> plot(svmfit , dat [train ,] ) 


We can perform cross-validation using tuneO to select the best choice of 
7 and cost for an SVM with a radial kernel: 

> set . seed (1) 

> tune.out = tune(svm , y~. , data = dat[train ,] , kernel = "radial " , 

ranges = list(cost=c(0.1,l,10,100,1000) , 
gamma=c(0.5,1,2,3,4))) 

> summary (tune . out ) 

Parameter tuning of ’ svm ’ : 

- sampling method: 10-fold cross validation 

- best parameters : 
cost gamma 

1 2 

- best performance: 0.12 


- 

Detailed performance results: 
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error 

dispersion 

1 

le-01 

0.5 

0.27 

0.1160 

2 

1 e +00 

0.5 
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0.0823 

3 

le+01 

0.5 

0.15 

0.0707 

4 

1 e +02 

0.5 

0.17 

0.0823 

5 

1 e +03 

0.5 

0.21 

0.0994 

6 

le-01 

1.0 

0.25 

0.1354 

7 

1 e +00 

1.0 

0.13 

0.0823 


Therefore, the best choice of parameters involves cost=l and gamma=2. We 
can view the test set predictions for this model by applying the predict () 
function to the data. Notice that to do this we subset the dataframe dat 
using -train as an index set. 

> table(true = dat [-train ,"y"] , pred = predict(tune.out$best .model , 
newx = dat [-train ,])) 

39 % of test observations are misclassified by this SVM. 
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9.6.3 ROC Curves 

The ROCR package can be used to produce ROC curves such as those in 
Figures 9.10 and 9.11. We first write a short function to plot an ROC curve 
given a vector containing a numerical score for each observation, pred, and 
a vector containing the class label for each observation, truth. 

> library(ROCR) 

> rocplot=function(pred, truth, ...){ 

+ predob = prediction(pred , truth) 

+ perf = performance (predob , "tpr", "fpr") 

+ plot(perf ,...)} 

SVMs and support vector classifiers output class labels for each observa¬ 
tion. However, it is also possible to obtain fitted values for each observation, 
which are the numerical scores used to obtain the class labels. For instance, 
in the case of a support vector classifier, the fitted value for an observation 
X = (Xi, X 2 ,..., X P ) T takes the form /3 0 + /3iA'i + /3 2 A ' 2 + ... + 6 P X P . 
For an SVM with a non-linear kernel, the equation that yields the fitted 
value is given in (9.23). In essence, the sign of the fitted value determines 
on which side of the decision boundary the observation lies. Therefore, the 
relationship between the fitted value and the class prediction for a given 
observation is simple: if the fitted value exceeds zero then the observation 
is assigned to one class, and if it is less than zero than it is assigned to the 
other. In order to obtain the fitted values for a given SVM model fit, we 
use decision.values=TRUE when fitting svm(). Then the predict () function 
will output the fitted values. 

> svmfit.opt = svm(y~. , data = dat[train,] , kernel = "radial", 

gamma=2, cost=l,decision.values=T) 

> fitted = attributes (predict(svmfit.opt,dat [train ,] ,decision. 

values =TRUE))$decision.values 

Now we can produce the ROC plot. 

> par(mfrow = c( 1 , 2 )) 

> rocplot (f itted , dat [train y"], main = " Training Data") 

SVM appears to be producing accurate predictions. By increasing 7 we can 
produce a more flexible fit and generate further improvements in accuracy. 

> svmfit.flex = svm(y~. , data = dat [train,] , kernel="radial", 

gamma=50, cost=l, decision.values=T) 

> fitted = attributes (predict(svmfit.flex,dat [train ,] ,decision. 

values=T))$decision.values 

> rocplot(fitted,dat [train,"y"] ,add=T,col = "red") 

However, these ROC curves are all on the training data. We are really 
more interested in the level of prediction accuracy on the test data. When 
we compute the ROC curves on the test data, the model with 7 = 2 appears 
to provide the most accurate results. 
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> fitted = attributes (predict(svmfit.opt ,dat[-train ,] ,decision . 

values=T))$decision.values 

> rocplot(fitted,dat[-trainy"],main="Test Data") 

> fitted=attributes(predict(svmfit.flex,dat[-train,],decision. 

values=T))$decision.values 

> rocplot(fitted,dat [-train,"y"] ,add = T,col = "red") 


9.6.4 SVM with Multiple Classes 

If the response is a factor containing more than two levels, then the svm() 
function will perform multi-class classification using the one-versus-one ap¬ 
proach. We explore that setting here by generating a third class of obser¬ 
vations. 

> set.seed (1) 

> x=rbind(x, matrix(rnorm(50*2), ncol=2)) 

> y=c(y, rep(0,50)) 

> x [y = = 0,2] = x [y = = 0 , 2] +2 

> dat=data.frame(x=x, y=as.factor(y)) 

> par(mfrow=c(1,1)) 

> plot(x,col=(y+1)) 

We now fit an SVM to the data: 

> svmfit = svm(y~. , data = dat, kernel = "radial", cost=10, gamma = l) 

> plot(svmfit, dat) 

The el071 library can also be used to perform support vector regression, 
if the response vector that is passed in to svm() is numerical rather than a 
factor. 


9.6.5 Application to Gene Expression Data 

We now examine the Khan data set, which consists of a number of tissue 
samples corresponding to four distinct types of small round blue cell tu¬ 
mors. For each tissue sample, gene expression measurements are available. 
The data set consists of training data, xtrain and ytrain, and testing data, 
xtest and ytest. 

We examine the dimension of the data: 

> library(ISLR) 

> names(Khan) 

[1] "xtrain" "xtest" "ytrain" "ytest" 

> dim(Khan$xtrain ) 

[1] 63 2308 

> dim(Khan$xtest ) 

[1] 20 2308 

> length(Khan$ytrain) 

[1] 63 

> length(Khan$ytest) 

[ 1 ] 20 
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This data set consists of expression measurements for 2,308 genes. 
The training and test sets consist of 63 and 20 observations respectively. 

> table(Khan$ytrain) 

12 3 4 

8 23 12 20 

> table(Khan$ytest ) 

12 3 4 

3 6 6 5 

We will use a support vector approach to predict cancer subtype using gene 
expression measurements. In this data set, there are a very large number 
of features relative to the number of observations. This suggests that we 
should use a linear kernel, because the additional flexibility that will result 
from using a polynomial or radial kernel is unnecessary. 

> dat=data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain)) 

> out = svm(y~. , data = dat , kernel = "linear ", cost = 10) 

> summary(out) 

Call : 

svm (formula = y ~ data = dat, kernel = "linear", 
cost = 10) 

Parameters : 

SVM-Type: C-classification 

SVM-Kernel: linear 

cost: 10 

gamma : 0.000433 

Number of Support Vectors : 58 

( 20 20 11 7 ) 

Number of Classes : 4 

Levels : 

12 3 4 

> table(out$fitted , dat$y) 

12 3 4 

1 8 0 0 0 

2 0 23 0 0 

3 0 0 12 0 

4 0 0 0 20 

We see that there are no training errors. In fact, this is not surprising, 
because the large number of variables relative to the number of observations 
implies that it is easy to find hyperplanes that fully separate the classes. We 
are most interested not in the support vector classifier’s performance on the 
training observations, but rather its performance on the test observations. 

> dat.te=data.frame(x=Khan$xtest, y=as.factor(Khan$ytest)) 

> pred.te=predict(out, newdata=dat.te) 

> table(pred.te, dat.te$y) 

pred.te 1 2 3 4 

1 3 0 0 0 

2 0 6 2 0 

3 0 0 4 0 

4 0 0 0 5 
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We see that using cost=10 yields two test set errors on this data. 


9.7 Exercises 

Conceptual 

1. This problem involves hyperplanes in two dimensions. 

(a) Sketch the hyperplane 1 + 3Xi — X 2 = 0. Indicate the set of 
points for which 1 + 3Xi — A" 2 > 0, as well as the set of points 
for which 1 + 3Xi — X 2 < 0. 

(b) On the same plot, sketch the hyperplane —2 + X 1 + 2X 2 = 0. 
Indicate the set of points for which —2 + Xi + 2 X 2 > 0, as well 
as the set of points for which —2 + A'i + 2 X 2 < 0. 

2. We have seen that in p = 2 dimensions, a linear decision boundary 
takes the form /3 0 + /3 iXi + /3 2 X 2 = 0. We now investigate a non-linear 
decision boundary. 

(a) Sketch the curve 

(l+X 1 ) 2 + (2-X 2 ) 2 =4. 

(b) On your sketch, indicate the set of points for which 

(l + X 1 ) 2 + (2-X 2 ) 2 >4, 
as well as the set of points for which 

(1 + Xi) 2 + (2 — X 2 ) 2 < 4. 

(c) Suppose that a classifier assigns an observation to the blue class 
if 

(1 + Xr) 2 + (2 — X 2 ) 2 >4, 

and to the red class otherwise. To what class is the observation 
(0,0) classified? (-1,1)? (2,2)? (3,8)? 

(d) Argue that while the decision boundary in (c) is not linear in 
terms of Xj and X 2 , it is linear in terms of Xi, X 2 , X 2 , and 
X 2 . 

3. Here we explore the maximal margin classifier on a toy data set. 

(a) We are given n = 7 observations in p = 2 dimensions. For each 
observation, there is an associated class label. 
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Obs. 

Ad 

a 2 

Y 

1 

3 

4 

Red 

2 

2 

2 

Red 

3 

4 

4 

Red 

4 

1 

4 

Red 

5 

2 

1 

Blue 

6 

4 

3 

Blue 

7 

4 

1 

Blue 


Sketch the observations. 

(b) Sketch the optimal separating hyperplane, and provide the equa¬ 
tion for this hyperplane (of the form (9.1)). 

(c) Describe the classification rule for the maximal margin classifier. 
It should be something along the lines of “Classify to Red if 
/9o + PiXi + /? 2 A 2 > 0, and classify to Blue otherwise.” Provide 
the values for /3o, /3i, and /3 2 . 

(d) On your sketch, indicate the margin for the maximal margin 
hyperplane. 

(e) Indicate the support vectors for the maximal margin classifier. 

(f) Argue that a slight movement of the seventh observation would 
not affect the maximal margin hyperplane. 

(g) Sketch a hyperplane that is not the optimal separating hyper¬ 
plane, and provide the equation for this hyperplane. 

(h) Draw an additional observation on the plot so that the two 
classes are no longer separable by a hyperplane. 

Applied 

4. Generate a simulated two-class data set with 100 observations and 
two features in which there is a visible but non-linear separation be¬ 
tween the two classes. Show that in this setting, a support vector 
machine with a polynomial kernel (with degree greater than 1) or a 
radial kernel will outperform a support vector classifier on the train¬ 
ing data. Which technique performs best on the test data? Make 
plots and report training and test error rates in order to back up 
your assertions. 

5. We have seen that we can fit an SVM with a non-linear kernel in order 
to perform classification using a non-linear decision boundary. We will 
now see that we can also obtain a non-linear decision boundary by 
performing logistic regression using non-linear transformations of the 
features. 
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(a) Generate a data set with n = 500 and p = 2, such that the obser¬ 
vations belong to two classes with a quadratic decision boundary 
between them. For instance, you can do this as follows: 

> xl=runif(500)-0.5 

> x2 = runif (500) -0.5 

> y=l*(xl~2-x2~2 > 0) 

(b) Plot the observations, colored according to their class labels. 
Your plot should display X\ on the axaxis, and X 2 on the y- 
axis. 

(c) Fit a logistic regression model to the data, using X\ and X 2 as 
predictors. 

(d) Apply this model to the training data in order to obtain a pre¬ 
dicted class label for each training observation. Plot the ob¬ 
servations, colored according to the predicted class labels. The 
decision boundary should be linear. 

(e) Now fit a logistic regression model to the data using non-linear 
functions of X\ and X 2 as predictors (e.g. Xf, X\ x X 2 , log(A" 2 ), 
and so forth). 

(f) Apply this model to the training data in order to obtain a pre¬ 
dicted class label for each training observation. Plot the ob¬ 
servations, colored according to the predicted class labels. The 
decision boundary should be obviously non-linear. If it is not, 
then repeat (a)-(e) until you come up with an example in which 
the predicted class labels are obviously non-linear. 

(g) Fit a support vector classifier to the data with X\ and X 2 as 
predictors. Obtain a class prediction for each training observa¬ 
tion. Plot the observations, colored according to the predicted 
class labels. 

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class 
prediction for each training observation. Plot the observations, 
colored according to the predicted class labels. 

(i) Comment on your results. 

6. At the end of Section 9.6.1, it is claimed that in the case of data that 
is just barely linearly separable, a support vector classifier with a 
small value of cost that misclassifies a couple of training observations 
may perform better on test data than one with a huge value of cost 
that does not misclassify any training observations. You will now 
investigate this claim. 

(a) Generate two-class data with p = 2 in such a way that the classes 
are just barely linearly separable. 
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(b) Compute the cross-validation error rates for support vector 
classifiers with a range of cost values. How many training er¬ 
rors are misclassified for each value of cost considered, and how 
does this relate to the cross-validation errors obtained? 

(c) Generate an appropriate test data set, and compute the test 
errors corresponding to each of the values of cost considered. 
Which value of cost leads to the fewest test errors, and how 
does this compare to the values of cost that yield the fewest 
training errors and the fewest cross-validation errors? 

(d) Discuss your results. 

7. In this problem, you will use support vector approaches in order to 
predict whether a given car gets high or low gas mileage based on the 
Auto data set. 

(a) Create a binary variable that takes on a 1 for cars with gas 
mileage above the median, and a 0 for cars with gas mileage 
below the median. 

(b) Fit a support vector classifier to the data with various values 
of cost, in order to predict whether a car gets high or low gas 
mileage. Report the cross-validation errors associated with dif¬ 
ferent values of this parameter. Comment on your results. 

(c) Now repeat (b), this time using SVMs with radial and polyno¬ 
mial basis kernels, with different values of gamma and degree and 
cost. Comment on your results. 

(d) Make some plots to back up your assertions in (b) and (c). 

Hint: In the lab, we used the plotO function for svm objects 
only in cases with p = 2. When p > 2, you can use the plotO 
function to create plots displaying pairs of variables at a time. 
Essentially, instead of typing 

> plot(svmfit, dat) 

where svmfit contains your fitted model and dat is a data frame 
containing your data, you can type 

> plot(svmfit , dat, xl~x4) 

in order to plot just the first and fourth variables. However, you 
must replace xl and x4 with the correct variable names. To find 
out more, type ?plot.svm. 

8. This problem involves the OJ data set which is part of the ISLR 
package. 
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(a) Create a training set containing a random sample of 800 
observations, and a test set containing the remaining 
observations. 

(b) Fit a support vector classifier to the training data using 
cost=0.01, with Purchase as the response and the other variables 
as predictors. Use the summary () function to produce summary 
statistics, and describe the results obtained. 

(c) What are the training and test error rates? 

(d) Use the tuneO function to select an optimal cost. Consider val¬ 
ues in the range 0.01 to 10. 

(e) Compute the training and test error rates using this new value 
for cost. 

(f) Repeat parts (b) through (e) using a support vector machine 
with a radial kernel. Use the default value for gamma. 

(g) Repeat parts (b) through (e) using a support vector machine 
with a polynomial kernel. Set degree=2. 

(h) Overall, which approach seems to give the best results on this 
data? 
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Unsupervised Learning 


Most of this book concerns supervised learning methods such as 
regression and classification. In the supervised learning setting, we typically 
have access to a set of p features Xi,X 2 ,... ,X P , measured on n obser¬ 
vations, and a response Y also measured on those same n observations. 
The goal is then to predict Y using X\, X 2 , ■.., X p . 

This chapter will instead focus on unsupervised learning , a set of sta¬ 
tistical tools intended for the setting in which we have only a set of fea¬ 
tures X\ 1 A' 2 ,.. •, X p measured on n observations. We are not interested 
in prediction, because we do not have an associated response variable Y. 
Rather, the goal is to discover interesting things about the measurements 
on Xi, A' 2 ,..., X p . Is there an informative way to visualize the data? Can 
we discover subgroups among the variables or among the observations? 
Unsupervised learning refers to a diverse set of techniques for answering 
questions such as these. In this chapter, we will focus on two particu¬ 
lar types of unsupervised learning: principal components analysis , a tool 
used for data visualization or data pre-processing before supervised tech¬ 
niques are applied, and clustering , a broad class of methods for discovering 
unknown subgroups in data. 


10.1 The Challenge of Unsupervised Learning 

Supervised learning is a well-understood area. In fact, if you have read 
the preceding chapters in this book, then you should by now have a good 
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grasp of supervised learning. For instance, if you are asked to predict a 
binary outcome from a data set, you have a very well developed set of tools 
at your disposal (such as logistic regression, linear discriminant analysis, 
classification trees, support vector machines, and more) as well as a clear 
understanding of how to assess the quality of the results obtained (using 
cross-validation, validation on an independent test set, and so forth). 

In contrast, unsupervised learning is often much more challenging. The 
exercise tends to be more subjective, and there is no simple goal for the 
analysis, such as prediction of a response. Unsupervised learning is often 
performed as part of an exploratory data analysis. Furthermore, it can be 
hard to assess the results obtained from unsupervised learning methods, 
since there is no universally accepted mechanism for performing cross- 
validation or validating results on an independent data set. The reason 
for this difference is simple. If we fit a predictive model using a supervised 
learning technique, then it is possible to check our work by seeing how 
well our model predicts the response Y on observations not used in fitting 
the model. However, in unsupervised learning, there is no way to check our 
work because we don’t know the true answer—the problem is unsupervised. 

Techniques for unsupervised learning are of growing importance in a 
number of fields. A cancer researcher might assay gene expression levels in 
100 patients with breast cancer. He or she might then look for subgroups 
among the breast cancer samples, or among the genes, in order to obtain 
a better understanding of the disease. An online shopping site might try 
to identify groups of shoppers with similar browsing and purchase histo¬ 
ries, as well as items that are of particular interest to the shoppers within 
each group. Then an individual shopper can be preferentially shown the 
items in which he or she is particularly likely to be interested, based on 
the purchase histories of similar shoppers. A search engine might choose 
what search results to display to a particular individual based on the click 
histories of other individuals with similar search patterns. These statistical 
learning tasks, and many more, can be performed via unsupervised learning 
techniques. 


10.2 Principal Components Analysis 

Principal components are discussed in Section 6.3.1 in the context of 
principal components regression. When faced with a large set of corre¬ 
lated variables, principal components allow us to summarize this set with 
a smaller number of representative variables that collectively explain most 
of the variability in the original set. The principal component directions 
are presented in Section 6.3.1 as directions in feature space along which 
the original data are highly variable. These directions also define lines and 
subspaces that are as close as possible to the data cloud. To perform 


exploratory 
data analysis 
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principal components regression, we simply use principal components as 
predictors in a regression model in place of the original larger set of vari¬ 
ables. 

Principal component analysis (PCA) refers to the process by which prin¬ 
cipal components are computed, and the subsequent use of these compo¬ 
nents in understanding the data. PCA is an unsupervised approach, since 
it involves only a set of features X\. W>,..., X p , and no associated response 
Y. Apart from producing derived variables for use in supervised learning 
problems, PCA also serves as a tool for data visualization (visualization of 
the observations or visualization of the variables). We now discuss PCA in 
greater detail, focusing on the use of PCA as a tool for unsupervised data 
exploration, in keeping with the topic of this chapter. 

10.2.1 What Are Principal Components? 

Suppose that we wish to visualize n observations with measurements on a 
set of p features, X±, X 2 ,..., X p , as part of an exploratory data analysis. 
We could do this by examining two-dimensional scatterplots of the data, 
each of which contains the n observations’ measurements on two of the 
features. However, there are (?J) = p{p— 1)/2 such scatterplots; for example, 
with p = 10 there are 45 plots! If p is large, then it will certainly not be 
possible to look at all of them; moreover, most likely none of them will 
be informative since they each contain just a small fraction of the total 
information present in the data set. Clearly, a better method is required to 
visualize the n observations when p is large. In particular, we would like to 
find a low-dimensional representation of the data that captures as much of 
the information as possible. For instance, if we can obtain a two-dimensional 
representation of the data that captures most of the information, then we 
can plot the observations in this low-dimensional space. 

PCA provides a tool to do just this. It finds a low-dimensional represen¬ 
tation of a data set that contains as much as possible of the variation. The 
idea is that each of the n observations lives in p-dimensional space, but not 
all of these dimensions are equally interesting. PCA seeks a small number 
of dimensions that are as interesting as possible, where the concept of in¬ 
teresting is measured by the amount that the observations vary along each 
dimension. Each of the dimensions found by PCA is a linear combination 
of the p features. We now explain the manner in which these dimensions, 
or principal components , are found. 

The first principal component of a set of features X\ , X 2 , ■ ■ ., X p is the 
normalized linear combination of the features 

= </>nA'i + (j}2iX-2 + ■ ■ ■ + 4>piX p (10-1) 

that has the largest variance. By normalized, we mean that )Cj=i fflji = 1- 
We refer to the elements <f> n,..., cj> p 1 as the loadings of the first principal 


principal 

component 

analysis 


loading 
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component; together, the loadings make up the principal component load¬ 
ing vector, (f >i = (0n (j> 2 i ... 4> p i) t ■ We constrain the loadings so that 
their sum of squares is equal to one, since otherwise setting these elements 
to be arbitrarily large in absolute value could result in an arbitrarily large 
variance. 

Given a n x p data set X, how do we compute the first principal com¬ 
ponent? Since we are only interested in variance, we assume that each of 
the variables in X has been centered to have mean zero (that is, the col¬ 
umn means of X are zero). We then look for the linear combination of the 
sample feature values of the form 


z il — T 021*£z2 T • • • 4“ <fipl%ip (10.2) 

that has largest sample variance, subject to the constraint that Y^'j=i i=1 - 
In other words, the first principal component loading vector solves the op¬ 
timization problem 


maximize < — 

0n, 1 n 

From (10.2) we can write the objective in (10.3) as i z fi- Since 

— Y17= i Xij = 0, the average of the z n,..., z n i will be zero as well. Hence 
the objective that we are maximizing in (10.3) is just the sample variance of 
the n values of zn- We refer to zn,..., z n \ as the scores of the first princi- score 
pal component. Problem (10.3) can be solved via an eigen decomposition, 
a standard technique in linear algebra, but details are outside of the scope 
of this book. 

There is a nice geometric interpretation for the first principal component. 

The loading vector 0i with elements 0 n, 02i, ■ ■ ■ , 0 p i defines a direction in 
feature space along which the data vary the most. If we project the n data 
points X\,... ,x n onto this direction, the projected values are the princi¬ 
pal component scores zn,..., z n i themselves. For instance, Figure 6.14 on 
page 230 displays the first principal component loading vector (green solid 
line) on an advertising data set. In these data, there are only two features, 
and so the observations as well as the first principal component loading 
vector can be easily displayed. As can be seen from (6.19), in that data set 
0n = 0.839 and 02i = 0.544. 

After the first principal component Z\ of the features has been deter¬ 
mined, we can find the second principal component Zi- The second prin¬ 
cipal component is the linear combination of Xi ,..., X p that has maximal 
variance out of all linear combinations that are uncorrelated with Z\. The 
second principal component scores Z 12 , Z 22 , • • •, z n 2 take the form 


j ^ (frjiXij I > subject to = (10-3) 


i=i \j=i 


3 =1 


z i2 — 012^il + 022*i2 + ■ • ■ + 4>p2Xi pi 


(10.4) 
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PCI 

PC2 

Murder 

0.5358995 

-0.4181809 

Assault 

0.5831836 

-0.1879856 

UrbanPop 

0.2781909 

0.8728062 

Rape 

0.5434321 

0.1673186 


TABLE 10.1. The principal component loading vectors, fa and <f> 2 , for the 
USArrests data. These are also displayed in Figure 10.1. 

where <fi 2 is the second principal component loading vector, with elements 
<f> 12 , 4> 22 , ■ ■ •, 4> P 2 - It turns out that constraining Z 2 to be uncorrelated with 
Z\ is equivalent to constraining the direction (j >2 to be orthogonal (perpen¬ 
dicular) to the direction fa. In the example in Figure 6.14, the observations 
lie in two-dimensional space (since p = 2), and so once we have found fa, 
there is only one possibility for <j> 2 , which is shown as a blue dashed line. 
(From Section 6.3.1, we know that fa 2 = 0.544 and fa 2 = —0.839.) But in 
a larger data set with p > 2 variables, there are multiple distinct principal 
components, and they are defined in a similar manner. To find fa, we solve 
a problem similar to (10.3) with fa replacing fa, and with the additional 
constraint that fa is orthogonal to fa . 1 

Once we have computed the principal components, we can plot them 
against each other in order to produce low-dimensional views of the data. 
For instance, we can plot the score vector Z\ against Z 2 , Z\ against Z 3 , 
Z 2 against Z 3 , and so forth. Geometrically, this amounts to projecting 
the original data down onto the subspace spanned by fa, fa, and fa, and 
plotting the projected points. 

We illustrate the use of PCA on the USArrests data set. For each of the 
50 states in the United States, the data set contains the number of arrests 
per 100,000 residents for each of three crimes: Assault, Murder, and Rape. 
We also record UrbanPop (the percent of the population in each state living 
in urban areas). The principal component score vectors have length n — 50, 
and the principal component loading vectors have length p = 4. PCA was 
performed after standardizing each variable to have mean zero and standard 
deviation one. Figure 10.1 plots the first two principal components of these 
data. The figure represents both the principal component scores and the 
loading vectors in a single biplot display. The loadings are also given in 
Table 10.1. 

In Figure 10.1, we see that the first loading vector places approximately 
equal weight on Assault, Murder, and Rape, with much less weight on 


a technical note, the principal component directions fa ■ <f> 2, </> 3,... are the 
ordered sequence of eigenvectors of the matrix X 7 X. and the variances of the compo¬ 
nents are the eigenvalues. There are at most min(n — !. p) principal components. 


biplot 
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FIGURE 10.1. The first two principal components for the USArrests data. The 
blue state names represent the scores for the first two principal components. The 
orange arrows indicate the first two principal component loading vectors (with 
axes on the top and right). For example, the loading for Rape on the first com¬ 
ponent is 0.54, and its loading on the second principal component 0.17 (the word 
Rape is centered at the point (0.54,0.17)). This figure is known as a biplot, be¬ 
cause it displays both the principal component scores and the principal component 
loadings. 


UrbanPop. Hence this component roughly corresponds to a measure of overall 
rates of serious crimes. The second loading vector places most of its weight 
on UrbanPop and much less weight on the other three features. Hence, this 
component roughly corresponds to the level of urbanization of the state. 
Overall, we see that the crime-related variables (Murder, Assault, and Rape) 
are located close to each other, and that the UrbanPop variable is far from 
the other three. This indicates that the crime-related variables are corre¬ 
lated with each other—states with high murder rates tend to have high 
assault and rape rates—and that the UrbanPop variable is less correlated 
with the other three. 
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We can examine differences between the states via the two principal com¬ 
ponent score vectors shown in Figure 10.1. Our discussion of the loading 
vectors suggests that states with large positive scores on the first compo¬ 
nent, such as California, Nevada and Florida, have high crime rates, while 
states like North Dakota, with negative scores on the first component, have 
low crime rates. California also has a high score on the second component, 
indicating a high level of urbanization, while the opposite is true for states 
like Mississippi. States close to zero on both components, such as Indiana, 
have approximately average levels of both crime and urbanization. 

10.2.2 Another Interpretation of Principal Components 

The first two principal component loading vectors in a simulated three- 
dimensional data set are shown in the left-hand panel of Figure 10.2; these 
two loading vectors span a plane along which the observations have the 
highest variance. 

In the previous section, we describe the principal component loading vec¬ 
tors as the directions in feature space along which the data vary the most, 
and the principal component scores as projections along these directions. 
However, an alternative interpretation for principal components can also be 
useful: principal components provide low-dimensional linear surfaces that 
are closest to the observations. We expand upon that interpretation here. 

The first principal component loading vector has a very special property: 
it is the line in p-dimensional space that is closest to the n observations 
(using average squared Euclidean distance as a measure of closeness). This 
interpretation can be seen in the left-hand panel of Figure 6.15; the dashed 
lines indicate the distance between each observation and the first principal 
component loading vector. The appeal of this interpretation is clear: we 
seek a single dimension of the data that lies as close as possible to all of 
the data points, since such a line will likely provide a good summary of the 
data. 

The notion of principal components as the dimensions that are clos¬ 
est to the n observations extends beyond just the first principal com¬ 
ponent. For instance, the first two principal components of a data set 
span the plane that is closest to the n observations, in terms of average 
squared Euclidean distance. An example is shown in the left-hand panel 
of Figure 10.2. The first three principal components of a data set span 
the three-dimensional hyperplane that is closest to the n observations, and 
so forth. 

Using this interpretation, together the first M principal component score 
vectors and the first M principal component loading vectors provide the 
best M -dimensional approximation (in terms of Euclidean distance) to 
the ith observation x.jj. This representation can be written 
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FIGURE 10.2. Ninety observations simulated in three dimensions. Left: the 
first two principal component directions span the plane that best fits the data. It 
minimizes the sum of squared distances from each point to the plane. Right: the 
first two principal component score vectors give the coordinates of the projection 
of the 90 observations onto the plane. The variance in the plane is maximized. 


M 

%ij ~ ^ ^ Zimtfijm (10.5) 

m— 1 

(assuming the original data matrix X is column-centered). In other words, 
together the M principal component score vectors and M principal com¬ 
ponent loading vectors can give a good approximation to the data when 
M is sufficiently large. When M = min(n — l,p), then the representation 
is exact: Xij — Zimfijm- 

10.2.3 More on PC A 

Scaling the Variables 

We have already mentioned that before PCA is performed, the variables 
should be centered to have mean zero. Furthermore, the results obtained 
when we perform PCA will also depend on whether the variables have been 
individually scaled (each multiplied by a different constant). This is in 
contrast to some other supervised and unsupervised learning techniques, 
such as linear regression, in which scaling the variables has no effect. (In 
linear regression, multiplying a variable by a factor of c will simply lead to 
multiplication of the corresponding coefficient estimate by a factor of 1 /c, 
and thus will have no substantive effect on the model obtained.) 

For instance, Figure 10.1 was obtained after scaling each of the variables 
to have standard deviation one. This is reproduced in the left-hand plot in 
Figure 10.3. Why does it matter that we scaled the variables? In these data, 
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FIGURE 10.3. Two principal component biplots for the USArrests data. Left: 
the same as Figure 10.1, with the variables scaled to have unit standard deviations. 
Right: principal components using unsealed data. Assault has by far the largest 
loading on the first principal component because it has the highest variance among 
the four variables. In general, scaling the variables to have standard deviation one 
is recommended. 


the variables are measured in different units; Murder, Rape, and Assault are 
reported as the number of occurrences per 100, 000 people, and UrbanPop 
is the percentage of the state’s population that lives in an urban area. 
These four variables have variance 18.97, 87.73, 6945.16, and 209.5, respec¬ 
tively. Consequently, if we perform PCA on the unsealed variables, then 
the first principal component loading vector will have a very large loading 
for Assault, since that variable has by far the highest variance. The right- 
hand plot in Figure 10.3 displays the first two principal components for the 
USArrests data set, without scaling the variables to have standard devia¬ 
tion one. As predicted, the first principal component loading vector places 
almost all of its weight on Assault, while the second principal component 
loading vector places almost all of its weight on UrpanPop. Comparing this 
to the left-hand plot, we see that scaling does indeed have a substantial 
effect on the results obtained. 

However, this result is simply a consequence of the scales on which the 
variables were measured. For instance, if Assault were measured in units 
of the number of occurrences per 100 people (rather than number of oc¬ 
currences per 100,000 people), then this would amount to dividing all of 
the elements of that variable by 1,000. Then the variance of the variable 
would be tiny, and so the first principal component loading vector would 
have a very small value for that variable. Because it is undesirable for the 
principal components obtained to depend on an arbitrary choice of scaling, 
we typically scale each variable to have standard deviation one before we 
perform PCA. 
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In certain settings, however, the variables may be measured in the same 
units. In this case, we might not wish to scale the variables to have stan¬ 
dard deviation one before performing PCA. For instance, suppose that the 
variables in a given data set correspond to expression levels for p genes. 
Then since expression is measured in the same “units” for each gene, we 
might choose not to scale the genes to each have standard deviation one. 

Uniqueness of the Principal Components 

Each principal component loading vector is unique, up to a sign flip. This 
means that two different software packages will yield the same principal 
component loading vectors, although the signs of those loading vectors 
may differ. The signs may differ because each principal component loading 
vector specifies a direction in p-dimensional space: flipping the sign has no 
effect as the direction does not change. (Consider Figure 6.14— the principal 
component loading vector is a line that extends in either direction, and 
flipping its sign would have no effect.) Similarly, the score vectors are unique 
up to a sign flip, since the variance of Z is the same as the variance of — Z. 
It is worth noting that when we use (10.5) to approximate Xij we multiply 
Zim by (pjm ■ Hence, if the sign is flipped on both the loading and score 
vectors, the final product of the two quantities is unchanged. 

The Proportion of Variance Explained 

In Figure 10.2, we performed PCA on a three-dimensional data set (left- 
hand panel) and projected the data onto the first two principal component 
loading vectors in order to obtain a two-dimensional view of the data (i.e. 
the principal component score vectors; right-hand panel). We see that this 
two-dimensional representation of the three-dimensional data does success¬ 
fully capture the major pattern in the data: the orange, green, and cyan 
observations that are near each other in three-dimensional space remain 
nearby in the two-dimensional representation. Similarly, we have seen on 
the USArrests data set that we can summarize the 50 observations and 4 
variables using just the first two principal component score vectors and the 
first two principal component loading vectors. 

We can now ask a natural question: how much of the information in 
a given data set is lost by projecting the observations onto the first few 
principal components? That is, how much of the variance in the data is not 
contained in the first few principal components? More generally, we are 
interested in knowing the proportion of variance explained (PVE) by each 
principal component. The total variance present in a data set (assuming 
that the variables have been centered to have mean zero) is defined as 

p p , n 

EVar( X l)=E-E4, 

j=i j =i *=i 


proportion 
of variance 
explained 
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FIGURE 10.4. Left: a scree plot depicting the proportion of variance explained 
by each of the four principal components in the USArrests data. Right: the cu¬ 
mulative proportion of variance explained by the four principal components in the 

USArrests data. 


and the variance explained by the mth principal component is 



»=1 


1 n / p \ 2 

~ /% I I 


(10.7) 


Therefore, the PVE of the mth principal component is given by 

2 
p 

j =1 
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( 10 . 8 ) 




The PVE of each principal component is a positive quantity. In order to 
compute the cumulative PVE of the first M principal components, we 
can simply sum (10.8) over each of the first M PVEs. In total, there are 
min(n — l,p) principal components, and their PVEs sum to one. 

In the USArrests data, the first principal component explains 62.0% of 
the variance in the data, and the next principal component explains 24.7% 
of the variance. Together, the first two principal components explain almost 
87 % of the variance in the data, and the last two principal components 
explain only 13% of the variance. This means that Figure 10.1 provides a 
pretty accurate summary of the data using just two dimensions. The PVE 
of each principal component, as well as the cumulative PVE, is shown 
in Figure 10.4. The left-hand panel is known as a scree plot , and will be 

scree plot 

discussed next. 


Deciding How Many Principal Components to Use 

In general, a n x p data matrix X has min(n — 1 ,p) distinct principal 
components. However, we usually are not interested in all of them; rather, 
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we would like to use just the first few principal components in order to 
visualize or interpret the data. In fact, we would like to use the smallest 
number of principal components required to get a good understanding of the 
data. How many principal components are needed? Unfortunately, there is 
no single (or simple!) answer to this question. 

We typically decide on the number of principal components required 
to visualize the data by examining a scree plot, such as the one shown 
in the left-hand panel of Figure 10.4. We choose the smallest number of 
principal components that are required in order to explain a sizable amount 
of the variation in the data. This is done by eyeballing the scree plot, and 
looking for a point at which the proportion of variance explained by each 
subsequent principal component drops off. This is often referred to as an 
elbow in the scree plot. For instance, by inspection of Figure 10.4, one 
might conclude that a fair amount of variance is explained by the first 
two principal components, and that there is an elbow after the second 
component. After all, the third principal component explains less than ten 
percent of the variance in the data, and the fourth principal component 
explains less than half that and so is essentially worthless. 

However, this type of visual analysis is inherently ad hoc. Unfortunately, 
there is no well-accepted objective way to decide how many principal com¬ 
ponents are enough. In fact, the question of how many principal compo¬ 
nents are enough is inherently ill-defined, and will depend on the specific 
area of application and the specific data set. In practice, we tend to look 
at the first few principal components in order to find interesting patterns 
in the data. If no interesting patterns are found in the first few principal 
components, then further principal components are unlikely to be of inter¬ 
est. Conversely, if the first few principal components are interesting, then 
we typically continue to look at subsequent principal components until no 
further interesting patterns are found. This is admittedly a subjective ap¬ 
proach, and is reflective of the fact that PCA is generally used as a tool for 
exploratory data analysis. 

On the other hand, if we compute principal components for use in a 
supervised analysis, such as the principal components regression presented 
in Section 6.3.1, then there is a simple and objective way to determine how 
many principal components to use: we can treat the number of principal 
component score vectors to be used in the regression as a tuning parameter 
to be selected via cross-validation or a related approach. The comparative 
simplicity of selecting the number of principal components for a supervised 
analysis is one manifestation of the fact that supervised analyses tend to 
be more clearly defined and more objectively evaluated than unsupervised 
analyses. 
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10.2.4 Other Uses for Principal Components 

We saw in Section 6.3.1 that we can perform regression using the principal 
component score vectors as features. In fact, many statistical techniques, 
such as regression, classification, and clustering, can be easily adapted to 
use the n x M matrix whose columns are the first MCp principal com¬ 
ponent score vectors, rather than using the full n x p data matrix. This 
can lead to less noisy results, since it is often the case that the signal (as 
opposed to the noise) in a data set is concentrated in its first few principal 
components. 


10.3 Clustering Methods 

Clustering refers to a very broad set of techniques for finding subgroups, or 
clusters , in a data set. When we cluster the observations of a data set, we 
seek to partition them into distinct groups so that the observations within 
each group are quite similar to each other, while observations in different 
groups are quite different from each other. Of course, to make this concrete, 
we must define what it means for two or more observations to be similar 
or different. Indeed, this is often a domain-specific consideration that must 
be made based on knowledge of the data being studied. 

For instance, suppose that we have a set of n observations, each with p 
features. The n observations could correspond to tissue samples for patients 
with breast cancer, and the p features could correspond to measurements 
collected for each tissue sample; these could be clinical measurements, such 
as tumor stage or grade, or they could be gene expression measurements. 
We may have a reason to believe that there is some heterogeneity among 
the n tissue samples; for instance, perhaps there are a few different un¬ 
known subtypes of breast cancer. Clustering could be used to find these 
subgroups. This is an unsupervised problem because we are trying to dis¬ 
cover structure- in this case, distinct clusters—on the basis of a data set. 
The goal in supervised problems, on the other hand, is to try to predict 
some outcome vector such as survival time or response to drug treatment. 

Both clustering and PCA seek to simplify the data via a small number 
of summaries, but their mechanisms are different: 

• PCA looks to find a low-dimensional representation of the observa¬ 
tions that explain a good fraction of the variance; 

• Clustering looks to find homogeneous subgroups among the observa¬ 
tions. 

Another application of clustering arises in marketing. We may have ac¬ 
cess to a large number of measurements (e.g. median household income, 
occupation, distance from nearest urban area, and so forth) for a large 


clustering 
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number of people. Our goal is to perform market segmentation by identify¬ 
ing subgroups of people who might be more receptive to a particular form 
of advertising, or more likely to purchase a particular product. The task of 
performing market segmentation amounts to clustering the people in the 
data set. 

Since clustering is popular in many fields, there exist a great number of 
clustering methods. In this section we focus on perhaps the two best-known 
clustering approaches: K-means clustering and hierarchical clustering. In 
AT-means clustering, we seek to partition the observations into a pre-specified 
number of clusters. On the other hand, in hierarchical clustering, we do 
not know in advance how many clusters we want; in fact, we end up with 
a tree-like visual representation of the observations, called a dendrogram , 
that allows us to view at once the clusterings obtained for each possible 
number of clusters, from 1 to n. There are advantages and disadvantages 
to each of these clustering approaches, which we highlight in this chapter. 

In general, we can cluster observations on the basis of the features in 
order to identify subgroups among the observations, or we can cluster fea¬ 
tures on the basis of the observations in order to discover subgroups among 
the features. In what follows, for simplicity we will discuss clustering obser¬ 
vations on the basis of the features, though the converse can be performed 
by simply transposing the data matrix. 


10.3.1 K-Means Clustering 

A'-means clustering is a simple and elegant approach for partitioning a 
data set into K distinct, non-overlapping clusters. To perform A'-means 
clustering, we must first specify the desired number of clusters AT; then the 
A'-means algorithm will assign each observation to exactly one of the AT 
clusters. Figure 10.5 shows the results obtained from performing A'-means 
clustering on a simulated example consisting of 150 observations in two 
dimensions, using three different values of AT. 

The A'-means clustering procedure results from a simple and intuitive 
mathematical problem. We begin by defining some notation. Let C i ,..., Ck 
denote sets containing the indices of the observations in each cluster. These 
sets satisfy two properties: 

1. Ci U C 2 U ... U Ck — {1, ■ ■ ■, n}. In other words, each observation 
belongs to at least one of the AT clusters. 

2. Cfc fl Ck’ = 0 for all k ^ k'. In other words, the clusters are non¬ 
overlapping: no observation belongs to more than one cluster. 

For instance, if the *th observation is in the fcth cluster, then i £ Ck- The 
idea behind A'-means clustering is that a good clustering is one for which the 
within-cluster variation is as small as possible. The within-cluster variation 
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FIGURE 10.5. A simulated data set with 150 observations in two-dimensional 
space. Panels show the results of applying K-means clustering with different val¬ 
ues of K, the number of clusters. The color of each observation indicates the clus¬ 
ter to which it was assigned using the K-means clustering algorithm. Note that 
there is no ordering of the clusters, so the cluster coloring is arbitrary. These 
cluster labels were not used in clustering; instead, they are the outputs of the 
clustering procedure. 


for cluster C k is a measure W ( Ck ) of the amount by which the observations 
within a cluster differ from each other. Hence we want to solve the problem 

minimize TU(Cfc)l . (10.9) 

In words, this formula says that we want to partition the observations into 
K clusters such that the total within-cluster variation, summed over all K 
clusters, is as small as possible. 

Solving (10.9) seems like a reasonable idea, but in order to make it 
actionable we need to define the within-cluster variation. There are many 
possible ways to define this concept, but by far the most common choice 
involves squared Euclidean distance. That is, we define 

1 P 

W(C k ) = — £ (10-10) 

I k 'i,i’ec k j= 1 

where \C k \ denotes the number of observations in the fcth cluster. In other 
words, the within-cluster variation for the fcth cluster is the sum of all of 
the pairwise squared Euclidean distances between the observations in the 
fcth cluster, divided by the total number of observations in the fcth cluster. 
Combining (10.9) and (10.10) gives the optimization problem that defines 
A'-means clustering, 


minimize 

Ci,...,Ck 


K 


1 




( 10 . 11 ) 







388 


10. Unsupervised Learning 


Now, we would like to find an algorithm to solve (10.11)— that is, a 
method to partition the observations into A clusters such that the objective 
of (10.11) is minimized. This is in fact a very difficult problem to solve 
precisely, since there are almost A” ways to partition n observations into A 
clusters. This is a huge number unless A and n are tiny! Fortunately, a very 
simple algorithm can be shown to provide a local optimum—a pretty good 
solution —to the A-means optimization problem (10.11). This approach is 
laid out in Algorithm 10.1. 


Algorithm 10.1 K-Means Clustering 

1. Randomly assign a number, from 1 to A, to each of the observations. 

These serve as initial cluster assignments for the observations. 

2. Iterate until the cluster assignments stop changing: 

(a) For each of the A clusters, compute the cluster centroid. The 
fctli cluster centroid is the vector of the p feature means for the 
observations in the fcth cluster. 

(b) Assign each observation to the cluster whose centroid is closest 
(where closest is defined using Euclidean distance). 


Algorithm 10.1 is guaranteed to decrease the value of the objective 
(10.11) at each step. To understand why, the following identity is illu¬ 
minating: 


l p p 

\rj | yi — Xi 'i) = 2 _ > (10.12) 

1 ' fc| M'eCk 1=1 iec k j= 1 

where Xhj = ]yyy XheC k x ij is the mean for feature j in cluster Ch¬ 
in Step 2(a) the cluster means for each feature are the constants that 
minimize the sum-of-squared deviations, and in Step 2(b), reallocating the 
observations can only improve (10.12). This means that as the algorithm 
is run, the clustering obtained will continually improve until the result no 
longer changes; the objective of (10.11) will never increase. When the result 
no longer changes, a local optimum has been reached. Figure 10.6 shows 
the progression of the algorithm on the toy example from Figure 10.5. 
A-means clustering derives its name from the fact that in Step 2(a), the 
cluster centroids are computed as the mean of the observations assigned to 
each cluster. 

Because the A-means algorithm finds a local rather than a global opti¬ 
mum, the results obtained will depend on the initial (random) cluster as¬ 
signment of each observation in Step 1 of Algorithm 10.1. For this reason, 
it is important to run the algorithm multiple times from different random 
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FIGURE 10.6. The progress of the K-means algorithm on the example of Fig¬ 
ure 10.5 with K=3. Top left: the observations are shown. Top center: in Step 1 
of the algorithm, each observation is randomly assigned to a cluster. Top right: 
in Step 2(a), the cluster centroids are computed. These are shown as large col¬ 
ored disks. Initially the centroids are almost completely overlapping because the 
initial cluster assignments were chosen at random. Bottom left: in Step 2(b), 
each observation is assigned to the nearest centroid. Bottom center: Step 2(a) is 
once again performed, leading to new cluster centroids. Bottom right: the results 
obtained after ten iterations. 


initial configurations. Then one selects the best solution, i.e. that for which 
the objective (10.11) is smallest. Figure 10.7 shows the local optima ob¬ 
tained by running AT-means clustering six times using six different initial 
cluster assignments, using the toy data from Figure 10.5. In this case, the 
best clustering is the one with an objective value of 235.8. 

As we have seen, to perforin /C-means clustering, we must decide how 
many clusters we expect in the data. The problem of selecting K is far from 
simple. This issue, along with other practical considerations that arise in 
performing A'-means clustering, is addressed in Section 10.3.3. 
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FIGURE 10.7. K-means clustering performed six times on the data from Fig¬ 
ure 10.5 with K = 3, each time with a different random assignment of the ob¬ 
servations in Step 1 of the K-means algorithm. Above each plot is the value of 
the objective (10.11). Three different local optima were obtained, one of which 
resulted in a smaller value of the objective and provides better separation between 
the clusters. Those labeled in red all achieved the same best solution, with an 
objective value of 235.8. 

10.3.2 Hierarchical Clustering 

One potential disadvantage of .A-means clustering is that it requires us to 
pre-specify the number of clusters K. Hierarchical clustering is an alter¬ 
native approach which does not require that we commit to a particular 
choice of K. Hierarchical clustering has an added advantage over A"-means 
clustering in that it results in an attractive tree-based representation of the 
observations, called a dendrogram. 

In this section, we describe bottom-up or agglomerative clustering. 
This is the most common type of hierarchical clustering, and refers to 
the fact that a dendrogram (generally depicted as an upside-down tree; see 


bottom-up 

agglomerative 
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FIGURE 10.8. Forty-five observations generated in two-dimensional space. In 
reality there are three distinct classes, shown in separate colors. However, we will 
treat these class labels as unknown and will seek to cluster the observations in 
order to discover the classes from the data. 


Figure 10.9) is built starting from the leaves and combining clusters up to 
the trunk. We will begin with a discussion of how to interpret a dendrogram 
and then discuss how hierarchical clustering is actually performed—that is, 
how the dendrogram is built. 

Interpreting a Dendrogram 

We begin with the simulated data set shown in Figure 10.8, consisting of 
45 observations in two-dimensional space. The data were generated from a 
three-class model; the true class labels for each observation are shown in 
distinct colors. However, suppose that the data were observed without the 
class labels, and that we wanted to perform hierarchical clustering of the 
data. Hierarchical clustering (with complete linkage, to be discussed later) 
yields the result shown in the left-hand panel of Figure 10.9. How can we 
interpret this dendrogram? 

In the left-hand panel of Figure 10.9, each leaf of the dendrogram rep¬ 
resents one of the 45 observations in Figure 10.8. However, as we move 
up the tree, some leaves begin to fuse into branches. These correspond to 
observations that are similar to each other. As we move higher up the tree, 
branches themselves fuse, either with leaves or other branches. The earlier 
(lower in the tree) fusions occur, the more similar the groups of observa¬ 
tions are to each other. On the other hand, observations that fuse later 
(near the top of the tree) can be quite different. In fact, this statement 
can be made precise: for any two observations, we can look for the point in 
the tree where branches containing those two observations are first fused. 
The height of this fusion, as measured on the vertical axis, indicates how 
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FIGURE 10.9. Left: dendrogram obtained from hierarchically clustering the data 
from Figure 10.8 with complete linkage and Euclidean distance. Center: the den¬ 
drogram from the left-hand panel, cut at a height of nine (indicated by the dashed 
line). This cut results in two distinct clusters, shown in different colors. Right: 
the dendrogram from the left-hand panel, now cut at a height of five. This cut 
results in three distinct clusters, shown in different colors. Note that the colors 
were not used in clustering, but are simply used for display purposes in this figure. 


different the two observations are. Thus, observations that fuse at the very 
bottom of the tree are quite similar to each other, whereas observations 
that fuse close to the top of the tree will tend to be quite different. 

This highlights a very important point in interpreting dendrograms that 
is often misunderstood. Consider the left-hand panel of Figure 10.10, which 
shows a simple dendrogram obtained from hierarchically clustering nine 
observations. One can see that observations 5 and 7 are quite similar to 
each other, since they fuse at the lowest point on the dendrogram. Obser¬ 
vations 1 and 6 are also quite similar to each other. However, it is tempting 
but incorrect to conclude from the figure that observations 9 and 2 are 
quite similar to each other on the basis that they are located near each 
other on the dendrogram. In fact, based on the information contained in 
the dendrogram, observation 9 is no more similar to observation 2 than it 
is to observations 8,5, and 7. (This can be seen from the right-hand panel 
of Figure 10.10, in which the raw data are displayed.) To put it mathe¬ 
matically, there are 2" _1 possible reorderings of the dendrogram, where n 
is the number of leaves. This is because at each of the n — 1 points where 
fusions occur, the positions of the two fused branches could be swapped 
without affecting the meaning of the dendrogram. Therefore, we cannot 
draw conclusions about the similarity of two observations based on their 
proximity along the horizontal axis. Rather, we draw conclusions about 
the similarity of two observations based on the location on the vertical axis 
where branches containing those two observations first are fused. 
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FIGURE 10.10. An illustration of how to properly interpret a dendrogram with 
nine observations in two-dimensional space. Left: a dendrogram generated using 
Euclidean distance and complete linkage. Observations 5 and 7 are quite similar 
to each other, as are observations 1 and 6. However, observation 9 is no more 
similar to observation 2 than it is to observations 8, 5, and 7, even though obser¬ 
vations 9 and 2 are close together in terms of horizontal distance. This is because 
observations 2, 8, 5, and 7 all fuse with observation 9 at the same height, approx¬ 
imately 1.8. Right: the raw data used to generate the dendrogram can be used to 
confirm that indeed, observation 9 is no more similar to observation 2 than it is 
to observations 8, 5, and 7. 


Now that we understand how to interpret the left-hand panel of Fig¬ 
ure 10.9, we can move on to the issue of identifying clusters on the basis 
of a dendrogram. In order to do this, we make a horizontal cut across the 
dendrogram, as shown in the center and right-hand panels of Figure 10.9. 
The distinct sets of observations beneath the cut can be interpreted as clus¬ 
ters. In the center panel of Figure 10.9, cutting the dendrogram at a height 
of nine results in two clusters, shown in distinct colors. In the right-hand 
panel, cutting the dendrogram at a height of five results in three clusters. 
Further cuts can be made as one descends the dendrogram in order to ob¬ 
tain any number of clusters, between 1 (corresponding to no cut) and n 
(corresponding to a cut at height 0, so that each observation is in its own 
cluster). In other words, the height of the cut to the dendrogram serves 
the same role as the K in iv-means clustering: it controls the number of 
clusters obtained. 

Figure 10.9 therefore highlights a very attractive aspect of hierarchical 
clustering: one single dendrogram can be used to obtain any number of 
clusters. In practice, people often look at the dendrogram and select by eye 
a sensible number of clusters, based on the heights of the fusion and the 
number of clusters desired. In the case of Figure 10.9, one might choose to 
select either two or three clusters. However, often the choice of where to 
cut the dendrogram is not so clear. 
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The term hierarchical refers to the fact that clusters obtained by cutting 
the dendrogram at a given height are necessarily nested within the clusters 
obtained by cutting the dendrogram at any greater height. However, on 
an arbitrary data set, this assumption of hierarchical structure might be 
unrealistic. For instance, suppose that our observations correspond to a 
group of people with a 50-50 split of males and females, evenly split among 
Americans, Japanese, and French. We can imagine a scenario in which the 
best division into two groups might split these people by gender, and the 
best division into three groups might split them by nationality. In this case, 
the true clusters are not nested, in the sense that the best division into three 
groups does not result from taking the best division into two groups and 
splitting up one of those groups. Consequently, this situation could not be 
well-represented by hierarchical clustering. Due to situations such as this 
one, hierarchical clustering can sometimes yield worse (i.e. less accurate) 
results than A'-means clustering for a given number of clusters. 

The Hierarchical Clustering Algorithm 

The hierarchical clustering dendrogram is obtained via an extremely simple 
algorithm. We begin by defining some sort of dissimilarity measure between 
each pair of observations. Most often, Euclidean distance is used; we will 
discuss the choice of dissimilarity measure later in this chapter. The algo¬ 
rithm proceeds iteratively. Starting out at the bottom of the dendrogram, 
each of the n observations is treated as its own cluster. The two clusters 
that are most similar to each other are then fused so that there now are 
n — 1 clusters. Next the two clusters that are most similar to each other are 
fused again, so that there now are n — 2 clusters. The algorithm proceeds 
in this fashion until all of the observations belong to one single cluster, and 
the dendrogram is complete. Figure 10.11 depicts the first few steps of the 
algorithm, for the data from Figure 10.9. To summarize, the hierarchical 
clustering algorithm is given in Algorithm 10.2. 


This algorithm seems simple enough, but one issue has not been ad¬ 
dressed. Consider the bottom right panel in Figure 10.11. How did we 
determine that the cluster {5,7} should be fused with the cluster {8}? 
We have a concept of the dissimilarity between pairs of observations, but 
how do we define the dissimilarity between two clusters if one or both of 
the clusters contains multiple observations? The concept of dissimilarity 
between a pair of observations needs to be extended to a pair of groups 
of observations. This extension is achieved by developing the notion of 
linkage , which defines the dissimilarity between two groups of observa¬ 
tions. The four most common types of linkage— complete , average , single, 
and centroid —are briefly described in Table 10.2. Average, complete, and 
single linkage are most popular among statisticians. Average and complete 


linkage 
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Algorithm 10.2 Hierarchical Clustering 

1. Begin with n observations and a measure (such as Euclidean dis¬ 
tance) of all the ( 2 ) = n{n— l)/2 pairwise dissimilarities. Treat each 
observation as its own cluster. 

2. For i = n, n — 1,..., 2: 

(a) Examine all pairwise inter-cluster dissimilarities among the i 
clusters and identify the pair of clusters that are least dissimilar 
(that is, most similar). Fuse these two clusters. The dissimilarity 
between these two clusters indicates the height in the dendro¬ 
gram at which the fusion should be placed. 

(b) Compute the new pairwise inter-cluster dissimilarities among 
the i — 1 remaining clusters. 


Linkage 

Description 

Complete 

Maximal intercluster dissimilarity. Compute all pairwise dis¬ 
similarities between the observations in cluster A and the 
observations in cluster B, and record the largest of these 
dissimilarities. 

Single 

Minimal intercluster dissimilarity. Compute all pairwise dis¬ 
similarities between the observations in cluster A and the 
observations in cluster B, and record the smallest of these 
dissimilarities. Single linkage can result in extended, trailing 
clusters in which single observations are fused one-at-a-time. 

Average 

Mean intercluster dissimilarity. Compute all pairwise dis¬ 
similarities between the observations in cluster A and the 
observations in cluster B, and record the average of these 
dissimilarities. 

Centroid 

Dissimilarity between the centroid for cluster A (a mean 
vector of length p) and the centroid for cluster B. Centroid 
linkage can result in undesirable inversions. 


TABLE 10.2. A summary of the four most commonly-used types of linkage in 
hierarchical clustering. 


linkage are generally preferred over single linkage, as they tend to yield 
more balanced dendrograms. Centroid linkage is often used in genomics, 
but suffers from a major drawback in that an inversion can occur, whereby 
two clusters are fused at a height below either of the individual clusters in 
the dendrogram. This can lead to difficulties in visualization as well as in in¬ 
terpretation of the dendrogram. The dissimilarities computed in Step 2(b) 
of the hierarchical clustering algorithm will depend on the type of linkage 
used, as well as on the choice of dissimilarity measure. Hence, the resulting 
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A'i 


A'i 


FIGURE 10.11. An illustration of the first few steps of the hierarchical 
clustering algorithm, using the data from Figure 10.10, with complete linkage 
and Euclidean distance. Top Left: initially, there are nine distinct clusters, 
{1}, {2},..., {9}. Top Right: the two clusters that are closest together, {5} and 
{7}, are fused into a single cluster. Bottom Left: the two clusters that are closest 
together, {6} and {1}, are fused into a single cluster. Bottom Right: the two clus¬ 
ters that are closest together using complete linkage, {8} and the cluster {5, 7}, 
are fused into a single cluster. 

dendrogram typically depends quite strongly on the type of linkage used, 
as is shown in Figure 10.12. 

Choice of Dissimilarity Measure 

Thus far, the examples in this chapter have used Euclidean distance as the 
dissimilarity measure. But sometimes other dissimilarity measures might 
be preferred. For example, correlation-based, distance considers two obser¬ 
vations to be similar if their features are highly correlated, even though the 
observed values may be far apart in terms of Euclidean distance. This is 
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Average Linkage Complete Linkage 


Single Linkage 



FIGURE 10.12. Average, complete, and single linkage applied to an example 
data set. Average and complete linkage tend to yield more balanced clusters. 


an unusual use of correlation, which is normally computed between vari¬ 
ables; here it is computed between the observation profiles for each pair 
of observations. Figure 10.13 illustrates the difference between Euclidean 
and correlation-based distance. Correlation-based distance focuses on the 
shapes of observation profiles rather than their magnitudes. 

The choice of dissimilarity measure is very important, as it has a strong 
effect on the resulting dendrogram. In general, careful attention should be 
paid to the type of data being clustered and the scientific question at hand. 
These considerations should determine what type of dissimilarity measure 
is used for hierarchical clustering. 

For instance, consider an online retailer interested in clustering shoppers 
based on their past shopping histories. The goal is to identify subgroups 
of similar shoppers, so that shoppers within each subgroup can be shown 
items and advertisements that are particularly likely to interest them. Sup¬ 
pose the data takes the form of a matrix where the rows are the shoppers 
and the columns are the items available for purchase; the elements of the 
data matrix indicate the number of times a given shopper has purchased a 
given item (i.e. a 0 if the shopper has never purchased this item, a 1 if the 
shopper has purchased it once, etc.) What type of dissimilarity measure 
should be used to cluster the shoppers? If Euclidean distance is used, then 
shoppers who have bought very few items overall (i.e. infrequent users of 
the online shopping site) will be clustered together. This may not be desir¬ 
able. On the other hand, if correlation-based distance is used, then shoppers 
with similar preferences (e.g. shoppers who have bought items A and B but 
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FIGURE 10.13. Three observations with measurements on 20 variables are 
shown. Observations 1 and 3 have similar values for each variable and so there 
is a small Euclidean distance between them. But they are very weakly correlated , 
so they have a large correlation-based distance. On the other hand, observations 
1 and 2 have quite different values for each variable, and so there is a large 
Euclidean distance between them. But they are highly correlated, so there is a 
small correlation-based distance between them. 

never items C or D) will be clustered together, even if some shoppers with 
these preferences are higher-volume shoppers than others. Therefore, for 
this application, correlation-based distance may be a better choice. 

In addition to carefully selecting the dissimilarity measure used, one must 
also consider whether or not the variables should be scaled to have stan¬ 
dard deviation one before the dissimilarity between the observations is 
computed. To illustrate this point, we continue with the online shopping 
example just described. Some items may be purchased more frequently than 
others; for instance, a shopper might buy ten pairs of socks a year, but a 
computer very rarely. High-frequency purchases like socks therefore tend 
to have a much larger effect on the inter-shopper dissimilarities, and hence 
on the clustering ultimately obtained, than rare purchases like computers. 
This may not be desirable. If the variables are scaled to have standard de¬ 
viation one before the inter-observation dissimilarities are computed, then 
each variable will in effect be given equal importance in the hierarchical 
clustering performed. We might also want to scale the variables to have 
standard deviation one if they are measured on different scales; otherwise, 
the choice of units (e.g. centimeters versus kilometers) for a particular vari¬ 
able will greatly affect the dissimilarity measure obtained. It should come 
as no surprise that whether or not it is a good decision to scale the variables 
before computing the dissimilarity measure depends on the application at 
hand. An example is shown in Figure 10.14. We note that the issue of 
whether or not to scale the variables before performing clustering applies 
to IF-means clustering as well. 
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FIGURE 10.14. An eclectic online retailer sells two items: socks and computers. 
Left: the number of pairs of socks, and computers, purchased by eight online shop¬ 
pers is displayed. Each shopper is shown in a different color. If inter-observation 
dissimilarities are computed using Euclidean distance on the raw variables, then 
the number of socks purchased by an individual will drive the dissimilarities ob¬ 
tained, and the number of computers purchased will have little effect. This might be 
undesirable, since (1) computers are more expensive than socks and so the online 
retailer may be more interested in encouraging shoppers to buy computers than 
socks, and (2) a large difference in the number of socks purchased by two shoppers 
may be less informative about the shoppers’ overall shopping preferences than a 
small difference in the number of computers purchased. Center: the same data 
is shown, after scaling each variable by its standard deviation. Now the number 
of computers purchased will have a much greater effect on the inter-observation 
dissimilarities obtained. Right: the same data are displayed, but now the y-axis 
represents the number of dollars spent by each online shopper on socks and on 
computers. Since computers are much more expensive than socks, now computer 
purchase history will drive the inter-observation dissimilarities obtained. 


10.3.3 Practical Issues in Clustering 

Clustering can be a very useful tool for data analysis in the unsupervised 
setting. However, there are a number of issues that arise in performing 
clustering. We describe some of these issues here. 


Small Decisions with Big Consequences 

In order to perform clustering, some decisions must be made. 

• Should the observations or features first be standardized in some way? 
For instance, maybe the variables should be centered to have mean 
zero and scaled to have standard deviation one. 
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• In the case of hierarchical clustering, 

— What dissimilarity measure should be used? 

— What type of linkage should be used? 

— Where should we cut the dendrogram in order to obtain clusters? 

• In the case of if-means clustering, how many clusters should we look 
for in the data? 

Each of these decisions can have a strong impact on the results obtained. 
In practice, we try several different choices, and look for the one with 
the most useful or interpretable solution. With these methods, there is no 
single right answer—any solution that exposes some interesting aspects of 
the data should be considered. 

Validating the Clusters Obtained 

Any time clustering is performed on a data set we will find clusters. But we 
really want to know whether the clusters that have been found represent 
true subgroups in the data, or whether they are simply a result of clustering 
the noise. For instance, if we were to obtain an independent set of observa¬ 
tions, then would those observations also display the same set of clusters? 
This is a hard question to answer. There exist a number of techniques for 
assigning a p-value to a cluster in order to assess whether there is more 
evidence for the cluster than one would expect due to chance. However, 
there has been no consensus on a single best approach. More details can 
be found in Hastie et al. (2009). 

Other Considerations in Clustering 

Both K -means and hierarchical clustering will assign each observation to 
a cluster. However, sometimes this might not be appropriate. For instance, 
suppose that most of the observations truly belong to a small number of 
(unknown) subgroups, and a small subset of the observations are quite 
different from each other and from all other observations. Then since K- 
means and hierarchical clustering force every observation into a cluster, the 
clusters found may be heavily distorted due to the presence of outliers that 
do not belong to any cluster. Mixture models are an attractive approach 
for accommodating the presence of such outliers. These amount to a soft 
version of A-means clustering, and are described in Hastie et al. (2009). 

In addition, clustering methods generally are not very robust to pertur¬ 
bations to the data. For instance, suppose that we cluster n observations, 
and then cluster the observations again after removing a subset of the n 
observations at random. One would hope that the two sets of clusters ob¬ 
tained would be quite similar, but often this is not the case! 
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A Tempered Approach to Interpreting the Results of Clustering 

We have described some of the issues associated with clustering. However, 
clustering can be a very useful and valid statistical tool if used properly. We 
mentioned that small decisions in how clustering is performed, such as how 
the data are standardized and what type of linkage is used, can have a large 
effect on the results. Therefore, we recommend performing clustering with 
different choices of these parameters, and looking at the full set of results 
in order to see what patterns consistently emerge. Since clustering can be 
non-robust, we recommend clustering subsets of the data in order to get a 
sense of the robustness of the clusters obtained. Most importantly, we must 
be careful about how the results of a clustering analysis are reported. These 
results should not be taken as the absolute truth about a data set. Rather, 
they should constitute a starting point for the development of a scientific 
hypothesis and further study, preferably on an independent data set. 


10.4 Lab 1: Principal Components Analysis 

In this lab, we perform PCA on the USArrests data set, which is part of 
the base R package. The rows of the data set contain the 50 states, in 
alphabetical order. 

> states=row.names(USArrests) 

> states 

The columns of the data set contain the four variables. 

> names(USArrests) 

[1] "Murder" "Assault" "UrbanPop" "Rape" 

We first briefly examine the data. We notice that the variables have vastly 
different means. 

> apply(USArrests , 2, mean) 

Murder Assault UrbanPop Rape 

7.79 170.76 65.54 21.23 

Note that the apply () function allows us to apply a function—in this case, 
the meanO function—to each row or column of the data set. The second 
input here denotes whether we wish to compute the mean of the rows, 1, 
or the columns, 2. We see that there are on average three times as many 
rapes as murders, and more than eight times as many assaults as rapes. 
We can also examine the variances of the four variables using the apply () 
function. 

> apply(USArrests, 2, var) 

Murder Assault UrbanPop Rape 

19.0 6945.2 209.5 87.7 
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Not surprisingly, the variables also have vastly different variances: the 
UrbanPop variable measures the percentage of the population in each state 
living in an urban area, which is not a comparable number to the num¬ 
ber of rapes in each state per 100,000 individuals. If we failed to scale the 
variables before performing PCA, then most of the principal components 
that we observed would be driven by the Assault variable, since it has by 
far the largest mean and variance. Thus, it is important to standardize the 
variables to have mean zero and standard deviation one before performing 
PCA. 

We now perform principal components analysis using the prcompO func¬ 
tion, which is one of several functions in R that perform PCA. 

> pr.out=prcomp(USArrests, scale=TRUE) 

By default, the prcompO function centers the variables to have mean zero. 
By using the option scale=TRUE, we scale the variables to have standard 
deviation one. The output from prcompO contains a number of useful quan¬ 
tities. 

> names(pr.out) 

[1] "sdev" "rotation" "center" "scale" "x" 

The center and scale components correspond to the means and standard 
deviations of the variables that were used for scaling prior to implementing 
PCA. 

> pr.out$center 


Murder 

Assault 

UrbanPop 

Rape 

7.79 

170.76 

65.54 

21.23 

pr.out$scale 
Murder Assault 

UrbanPop 

Rape 

4.36 

83.34 

14.47 

9.37 


The rotation matrix provides the principal component loadings; each col¬ 
umn of pr.out$rotation contains the corresponding principal component 
loading vector. 2 


> pr.out$rotation 
PCI 

PC2 

PC3 

PC4 

Murder 

-0.536 

0.418 

-0.341 

0.649 

Assault 

-0.583 

0.188 

-0.268 

-0.743 

UrbanPop 

-0.278 - 

0.873 

-0.378 

0.134 

Rape 

-0.543 - 

0.167 

0.818 

0.089 


We see that there are four distinct principal components. This is to be 
expected because there are in general min(n — 1 ,p) informative principal 
components in a data set with n observations and p variables. 


2 This function names it the rotation matrix, because when we matrix-multiply the 
X matrix by pr. out$rotation, it gives us the coordinates of the data in the rotated 
coordinate system. These coordinates are the principal component scores. 


prcompO 
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Using the prcompO function, we do not need to explicitly multiply the 
data by the principal component loading vectors in order to obtain the 
principal component score vectors. Rather the 50 x 4 matrix x has as its 
columns the principal component score vectors. That is, the fcth column is 
the fcth principal component score vector. 

> dim ( pr . out $x ) 

[1] 50 4 

We can plot the first two principal components as follows: 

> biplot(pr.out , scale=0) 

The scale=0 argument to biplot 0 ensures that the arrows are scaled to 
represent the loadings; other values for scale give slightly different biplots 
with different interpretations. 

Notice that this figure is a mirror image of Figure 10.1. Recall that 
the principal components are only unique up to a sign change, so we can 
reproduce Figure 10.1 by making a few small changes: 

> pr.out$rotation=-pr.out$rotation 

> pr.out$x=-pr.out$x 

> biplot(pr.out , scale=0) 

The prcompO function also outputs the standard deviation of each prin¬ 
cipal component. For instance, on the USArrests data set, we can access 
these standard deviations as follows: 

> pr.out$sdev 

[1] 1.575 0.995 0.597 0.416 

The variance explained by each principal component is obtained by squar¬ 
ing these: 

> pr.var=pr.out $sdev“2 

> pr.var 

[1] 2.480 0.990 0.357 0.173 

To compute the proportion of variance explained by each principal compo¬ 
nent, we simply divide the variance explained by each principal component 
by the total variance explained by all four principal components: 

> pve=pr.var/sum(pr.var) 

> pve 

[1] 0.6201 0.2474 0.0891 0.0434 

We see that the first principal component explains 62.0% of the variance 
in the data, the next principal component explains 24.7% of the variance, 
and so forth. We can plot the PVE explained by each component, as well 
as the cumulative PVE, as follows: 

> plot(pve, xlab="Principal Component", ylab="Proportion of 

Variance Explained", ylim=c(0,1) ,type=’b’) 

> plot(cumsum(pve), xlab="Principal Component", ylab=" 

Cumulative Proportion of Variance Explained", ylim=c(0,l), 
type =’b’) 


biplot() 
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The result is shown in Figure 10.4. Note that the function cumsumO com¬ 
putes the cumulative sum of the elements of a numeric vector. For instance: 

> a=c(1,2,8,-3) 

> cumsum(a) 

[1] 1 3 11 8 


10.5 Lab 2: Clustering 

10.5.1 K-Means Clustering 

The function kmeansO performs K -means clustering in R. We begin with 
a simple simulated example in which there truly are two clusters in the 
data: the first 25 observations have a mean shift relative to the next 25 
observations. 

> set.seed(2) 

> x=matrix(rnorm(50*2), ncol=2) 

> x [1: 25,1] =x [1:25,1] +3 

> x [1 : 25,2]=x [1:25,2] -4 

We now perform if-means clustering with K = 2. 

> km . out =kmeans (x , 2 , nst art =20 ) 

The cluster assignments of the 50 observations are contained in 
km.out$cluster . 

> km.out$cluster 

[1] 22222222222222222222222221111 
[30] 111111111111111111111 

The A'-means clustering perfectly separated the observations into two clus¬ 
ters even though we did not supply any group information to kmeansO . We 
can plot the data, with each observation colored according to its cluster 
assignment. 

> plot(x, col = (km.out $cluster+1) , main ="K-Means Clustering 

Results with K=2", xlab="", ylab="", pch=20, cex=2) 

Here the observations can be easily plotted because they are two-dimensional 
If there were more than two variables then we could instead perform PCA 
and plot the first two principal components score vectors. 

In this example, we knew that there really were two clusters because 
we generated the data. However, for real data, in general we do not know 
the true number of clusters. We could instead have performed A'-means 
clustering on this example with K = 3. 

> set.seed(4) 

> km . out =kmeans (x , 3 , nst art =20) 

> km . out 

K-means clustering with 3 clusters of sizes 10, 23, 17 


cumsumO 


kmeansO 
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Cluster means: 

[,1] [,2] 

1 2.3001545 -2.69622023 

2 -0.3820397 -0.08740753 

3 3.7789567 -4.56200798 

Clustering vector : 

[1] 31313331313131313333313332222 
222222222222221212222 

Within cluster sum of squares by cluster : 

[1] 19.56137 52.67700 25.74089 

(between_SS / total_SS = 79.3 */*) 

Available components : 

[1] "cluster" "centers" "totss" "withinss" 

"tot.withinss" "betweenss" "size" 

> plot(x, col=(km.out$cluster+1), main="K-Means Clustering 

Results with K=3", xlab="", ylab="", pch=20, cex=2) 

When AT = 3, AT-means clustering splits up the two clusters. 

To run the kmeansO function in R with multiple initial cluster assign¬ 
ments, we use the nstart argument. If a value of nstart greater than one 
is used, then AT-means clustering will be performed using multiple random 
assignments in Step 1 of Algorithm 10.1, and the kmeansO function will 
report only the best results. Here we compare using nstart=l to nstart=20. 

> set.seed(3) 

> km . out =kmeans (x , 3 , nst art =1) 

> km.out$tot.withinss 

[1] 104.3319 

> km . out =kmeans (x , 3 , nst art =20) 

> km.out$tot.withinss 

[1] 97.9793 

Note that km.out$tot.withinss is the total within-cluster sum of squares, 
which we seek to minimize by performing A'-means clustering (Equation 
10.11). The individual within-cluster sum-of-squares are contained in the 
vector km. out$withinss. 

We strongly recommend always running AT-means clustering with a large 
value of nstart, such as 20 or 50, since otherwise an undesirable local 
optimum may be obtained. 

When performing AT-means clustering, in addition to using multiple ini¬ 
tial cluster assignments, it is also important to set a random seed using the 
set.seedO function. This way, the initial cluster assignments in Step 1 can 
be replicated, and the A'-means output will be fully reproducible. 
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10.5.2 Hierarchical Clustering 

The hclustO function implements hierarchical clustering in R. In the fol¬ 
lowing example we use the data from Section 10.5.1 to plot the hierarchical 
clustering dendrogram using complete, single, and average linkage cluster¬ 
ing, with Euclidean distance as the dissimilarity measure. We begin by 
clustering observations using complete linkage. The distO function is used 
to compute the 50 x 50 inter-observation Euclidean distance matrix. 

> he.complete=hclust(dist (x) , method = "complete " ) 

We could just as easily perform hierarchical clustering with average or 
single linkage instead: 

> he.average=hclust(dist(x), method="average") 

> he.single=hclust(dist(x), method="single") 

We can now plot the dendrograms obtained using the usual plot () function. 
The numbers at the bottom of the plot identify each observation. 

> par(mfrow=c(1,3)) 

> plot(he.complete,main="Complete Linkage", xlab="", sub="", 

cex=.9) 

> plot(he.average , main="Average Linkage", xlab="", sub="", 

cex= . 9) 

> plot(he.single, main="Single Linkage", xlab="", sub="", 

cex=.9) 

To determine the cluster labels for each observation associated with a 
given cut of the dendrogram, we can use the entree () function: 

> cutree(he.complete , 2) 

[1] 11111111111111111111111112222 
[30] 222222222222222222222 

> cutree(he.average , 2) 

[1] 11111111111111111111111112222 
[30] 222122222222221212222 

> cutree(he.single , 2) 

[1] 11111111111111121111111111111 
[30] 111111111111111111111 

For this data, complete and average linkage generally separate the observa¬ 
tions into their correct groups. However, single linkage identifies one point 
as belonging to its own cluster. A more sensible answer is obtained when 
four clusters are selected, although there are still two singletons. 

> cutree(he.single , 4) 

[1] 11111111111111121111111113333 

[30] 333333333333433333333 

To scale the variables before performing hierarchical clustering of the 
observations, we use the scale() function: 

> xsc=scale(x) 

> plot(hclust(dist(xsc), method="complete"), main="Hierarchical 

Clustering with Scaled Features") 


hclust() 


dist() 


cutree() 


scale() 
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Correlation-based distance can be computed using the as.distO func¬ 
tion, which converts an arbitrary square symmetric matrix into a form that 
the hclustO function recognizes as a distance matrix. However, this only 
makes sense for data with at least three features since the absolute corre¬ 
lation between any two observations with measurements on two features is 
always 1. Hence, we will cluster a three-dimensional data set. 

> x = matrix(rnorm (30*3) , ncol=3) 

> dd=as.dist(1-cor(t(x))) 

> plot(hclust(dd, method="complete"), main="Complete Linkage 

with Correlation-Based Distance", xlab = "", sub = "") 


10.6 Lab 3: NCI60 Data Example 

Unsupervised techniques are often used in the analysis of genomic data. 
In particular, PCA and hierarchical clustering are popular tools. We illus¬ 
trate these techniques on the NCI60 cancer cell line microarray data, which 
consists of 6,830 gene expression measurements on 64 cancer cell lines. 

> library(ISLR) 

> nci.labs=NCI60$labs 

> nci.data=NCI60$data 

Each cell line is labeled with a cancer type. We do not make use of the 
cancer types in performing PCA and clustering, as these are unsupervised 
techniques. But after performing PCA and clustering, we will check to 
see the extent to which these cancer types agree with the results of these 
unsupervised techniques. 

The data has 64 rows and 6,830 columns. 

> dim(nci.data) 

[1] 64 6830 


We begin by examining the cancer types for the cell lines. 


ci.labs[1:4] 

"CNS" "CNS" 

able(nci.labs) 

.labs 

BREAST 

"CNS" 

CNS 

RENAL” 

COLON 

K562A-repro 

K562B-repro 

7 

5 

7 

1 

1 

LEUKEMIA MCF7A 

-repro MCF7D 

-repro 

MELANOMA 

NSCLC 

6 

1 

1 

8 

9 

OVARIAN PROSTATE 

RENAL 

UNKNOWN 


6 

2 

9 

1 



.dist() 
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10.6.1 PC A on the NCI60 Data 

We first perform PCA on the data after scaling the variables (genes) to 
have standard deviation one, although one could reasonably argue that it 
is better not to scale the genes. 

> pr.out=prcomp(nci.data, scale=TRUE) 

We now plot the first few principal component score vectors, in order to 
visualize the data. The observations (cell lines) corresponding to a given 
cancer type will be plotted in the same color, so that we can see to what 
extent the observations within a cancer type are similar to each other. We 
first create a simple function that assigns a distinct color to each element 
of a numeric vector. The function will be used to assign a color to each of 
the 64 cell lines, based on the cancer type to which it corresponds. 

Cols=function(vec){ 

+ cols=rainbow(length(unique(vec))) 

+ return(cols[as.numeric(as.factor(vec))]) 

+ > 

Note that the rainbow() function takes as its argument a positive integer, 
and returns a vector containing that number of distinct colors. We now can 
plot the principal component score vectors. 

> par(mfrow=c(1,2)) 

> plot(pr.out$x[,l:2], col=Cols(nci.labs), pch=19, 

xlab = "Z1",ylab = "Z2") 

> plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19, 

xlab = "Z1",ylab ="Z3") 

The resulting plots are shown in Figure 10.15. On the whole, cell lines 
corresponding to a single cancer type do tend to have similar values on the 
first few principal component score vectors. This indicates that cell lines 
from the same cancer type tend to have pretty similar gene expression 
levels. 

We can obtain a summary of the proportion of variance explained (PVE) 
of the first few principal components using the summary () method for a 
prcomp object (we have truncated the printout): 

> summary (pr . out ) 

Importance of components: 




PCI 

PC2 

PC3 

PC4 

PC5 

Standard deviation 

27.853 21.4814 

19.8205 

17.0326 

15.9718 

Proportion 

of Variance 

0.114 

0.0676 

0.0575 

0.0425 

0.0374 

Cumulative 

Proportion 

0.114 

0.1812 

0.2387 

0.2812 

0.3185 


Using the plotO function, we can also plot the variance explained by the 
first few principal components. 

> plot (pr . out ) 

Note that the height of each bar in the bar plot is given by squaring the 
corresponding element of pr.out$sdev. However, it is more informative to 


rainbow() 
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FIGURE 10.15. Projections of the NCI60 cancer cell lines onto the first three 
principal components (in other words, the scores for the first three principal com¬ 
ponents). On the whole, observations belonging to a single cancer type tend to 
lie near each other in this low-dimensional space. It would not have been possible 
to visualize the data without using a dimension reduction method such as PCA, 
since based on the full data set there are ( 6, 2 3 °) possible scatterplots, none of 
which would have been particularly informative. 




plot the PVE of each principal component (i.e. a scree plot) and the cu¬ 
mulative PVE of each principal component. This can be done with just a 
little work. 

> pve = 100*pr.out$sdev ~2/sum(pr.out$sdev"2) 

> par(mfrow=c(1,2)) 

> plot(pve, type="o", ylab="PVE", xlab="Principal Component", 

col="blue") 

> plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab=" 

Principal Component", col="brown3") 

(Note that the elements of pve can also be computed directly from the sum¬ 
mary, summary (pr. out) $importance [2,] , and the elements of cumsum (pve) 
are given by summary (pr. out) $importance [3,] .) The resulting plots are shown 
in Figure 10.16. We see that together, the first seven principal components 
explain around 40 % of the variance in the data. This is not a huge amount 
of the variance. However, looking at the scree plot, we see that while each 
of the first seven principal components explain a substantial amount of 
variance, there is a marked decrease in the variance explained by further 
principal components. That is, there is an elbow in the plot after approx¬ 
imately the seventh principal component. This suggests that there may 
be little benefit to examining more than seven or so principal components 
(though even examining seven principal components may be difficult). 
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FIGURE 10.16. The PVE of the principal components of the NCI60 cancer cell 
line microarray data set. Left: the PVE of each principal component is shown. 
Right: the cumulative PVE of the principal components is shown. Together, all 
principal components explain 100% of the variance. 


10.6.2 Clustering the Observations of the NCI60 Data 

We now proceed to hierarchically cluster the cell lines in the NCI60 data, 
with the goal of finding out whether or not the observations cluster into 
distinct types of cancer. To begin, we standardize the variables to have 
mean zero and standard deviation one. As mentioned earlier, this step is 
optional and should be performed only if we want each gene to be on the 
same scale. 

> sd.data=scale(nci.data) 

We now perform hierarchical clustering of the observations using complete, 
single, and average linkage. Euclidean distance is used as the dissimilarity 
measure. 

> par(mfrow=c(1,3)) 

> data.dist=dist(sd.data) 

> plot(hclust(data.dist), labels=nci.labs, main="Complete 

Linkage", xlab="", sub="",ylab="") 

> plot(hclust(data.dist, method="average"), labels=nci.labs, 

main="Average Linkage", xlab="", sub="",ylab="") 

> plot(hclust(data.dist, method="single"), labels=nci.labs, 

main="Single Linkage", xlab="", sub="",ylab="") 

The results are shown in Figure 10.17. We see that the choice of linkage 
certainly does affect the results obtained. Typically, single linkage will tend 
to yield trailing clusters: very large clusters onto which individual observa¬ 
tions attach one-by-one. On the other hand, complete and average linkage 
tend to yield more balanced, attractive clusters. For this reason, complete 
and average linkage are generally preferred to single linkage. Clearly cell 
lines within a single cancer type do tend to cluster together, although the 
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FIGURE 10.17. 77ie NCI60 cancer cell line microarray data, clustered with av¬ 
erage, complete, and single linkage, and using Euclidean distance as the dissim¬ 
ilarity measure. Complete and average linkage tend to yield evenly sized clusters 
whereas single linkage tends to yield extended clusters to which single leaves are 
fused one by one. 
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clustering is not perfect. We will use complete linkage hierarchical cluster¬ 
ing for the analysis that follows. 

We can cut the dendrogram at the height that will yield a particular 
number of clusters, say four: 

> he.out=hclust(dist(sd.data)) 

> he . clusters = cutree (he . out ,4) 

> table(he.clusters,nci . labs) 

There are some clear patterns. All the leukemia cell lines fall in cluster 3, 
while the breast cancer cell lines are spread out over three different clusters. 
We can plot the cut on the dendrogram that produces these four clusters: 

> par(mfrow=c(1,1)) 

> plot(hc.out, labels=nci.labs) 

> abline(h=139, col="red") 

The abline () function draws a straight line on top of any existing plot 
in R. The argument h=139 plots a horizontal line at height 139 on the den¬ 
drogram; this is the height that results in four distinct clusters. It is easy 
to verify that the resulting clusters are the same as the ones we obtained 
using cutree (he . out, 4) . 

Printing the output of hclust gives a useful brief summary of the object: 


> he.out 


Call : 


hclust(d = dist(dat)) 

Cluster method : 

complete 

Distance : 

euclidean 

Number of objects: 

64 


We claimed earlier in Section 10.3.2 that AWneans clustering and hier¬ 
archical clustering with the dendrogram cut to obtain the same number 
of clusters can yield very different results. How do these NCI60 hierarchical 
clustering results compare to what we get if we perform A'-means clustering 
with AT = 4? 

> set . seed (2) 

> km.out=kmeans(sd.data, 4, nstart=20) 

> km.clusters=km.out$cluster 

> table(km.clusters,he.clusters) 

he.clusters 

km.clusters 1234 

1 11 0 0 9 

2 0 0 8 0 

3 9 0 0 0 

4 20 7 0 0 

We see that the four clusters obtained using hierarchical clustering and K- 
means clustering are somewhat different. Cluster 2 in A'-means clustering is 
identical to cluster 3 in hierarchical clustering. However, the other clusters 



10.7 Exercises 


413 


differ: for instance, cluster 4 in A'-means clustering contains a portion of 
the observations assigned to cluster 1 by hierarchical clustering, as well as 
all of the observations assigned to cluster 2 by hierarchical clustering. 

Rather than performing hierarchical clustering on the entire data matrix, 
we can simply perform hierarchical clustering on the first few principal 
component score vectors, as follows: 

> hc.out=hclust(dist(pr.out$x [ , 1:5]) ) 

> plot(he.out, labels=nci.labs, main="Hier. Clust. on First 

Five Score Vectors") 

> table(cutree(he.out,4), nci.labs) 

Not surprisingly, these results are different from the ones that we obtained 
when we performed hierarchical clustering on the full data set. Sometimes 
performing clustering on the first few principal component score vectors 
can give better results than performing clustering on the full data. In this 
situation, we might view the principal component step as one of denot¬ 
ing the data. We could also perform A'-means clustering on the first few 
principal component score vectors rather than the full data set. 

10.7 Exercises 

Conceptual 

1. This problem involves the A'-means clustering algorithm. 



(a) Prove (10.12). 


(b) On the basis of this identity, argue that the A'-means clustering 
algorithm (Algorithm 10.1) decreases the objective (10.11) at 
each iteration. 

2. Suppose that we have four observations, for which we compute a 
dissimilarity matrix, given by 


0.3 0.4 0.7 


0.3 0.5 0.8 
0.4 0.5 0.45 
0.7 0.8 0.45 


For instance, the dissimilarity between the first and second obser¬ 
vations is 0.3, and the dissimilarity between the second and fourth 
observations is 0.8. 

(a) On the basis of this dissimilarity matrix, sketch the dendrogram 
that results from hierarchically clustering these four observa¬ 
tions using complete linkage. Be sure to indicate on the plot the 
height at which each fusion occurs, as well as the observations 
corresponding to each leaf in the dendrogram. 
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(b) Repeat (a), this time using single linkage clustering. 

(c) Suppose that we cut the dendogram obtained in (a) such that 
two clusters result. Which observations are in each cluster? 

(d) Suppose that we cut the dendogram obtained in (b) such that 
two clusters result. Which observations are in each cluster? 

(e) It is mentioned in the chapter that at each fusion in the den¬ 
drogram, the position of the two clusters being fused can be 
swapped without changing the meaning of the dendrogram. Draw 
a dendrogram that is equivalent to the dendrogram in (a), for 
which two or more of the leaves are repositioned, but for which 
the meaning of the dendrogram is the same. 

3. In this problem, you will perform A'-means clustering manually, with 
K = 2, on a small example with n = 6 observations and p = 2 
features. The observations are as follows. 


Obs. 

Xi X 2 

1 

1 4 

2 

1 3 

3 

0 4 

4 

5 1 

5 

6 2 

6 

4 0 


(a) Plot the observations. 

(b) Randomly assign a cluster label to each observation. You can 
use the sample () command in R to do this. Report the cluster 
labels for each observation. 

(c) Compute the centroid for each cluster. 

(d) Assign each observation to the centroid to which it is closest, in 
terms of Euclidean distance. Report the cluster labels for each 
observation. 

(e) Repeat (c) and (d) until the answers obtained stop changing. 

(f) In your plot from (a), color the observations according to the 
cluster labels obtained. 

4. Suppose that for a particular data set, we perform hierarchical clus¬ 
tering using single linkage and using complete linkage. We obtain two 
dendrograms. 

(a) At a certain point on the single linkage dendrogram, the clus¬ 
ters {1,2,3} and {4,5} fuse. On the complete linkage dendro¬ 
gram, the clusters {1, 2,3} and {4, 5} also fuse at a certain point. 
Which fusion will occur higher on the tree, or will they fuse at 
the same height, or is there not enough information to tell? 
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(b) At a certain point on the single linkage dendrogram, the clusters 
{5} and {6} fuse. On the complete linkage dendrogram, the clus¬ 
ters {5} and {6} also fuse at a certain point. Which fusion will 
occur higher on the tree, or will they fuse at the same height, or 
is there not enough information to tell? 

5. In words, describe the results that you would expect if you performed 
AT-means clustering of the eight shoppers in Figure 10.14, on the 
basis of their sock and computer purchases, with K = 2. Give three 
answers, one for each of the variable scalings displayed. Explain. 

6. A researcher collects expression measurements for 1,000 genes in 100 
tissue samples. The data can be written as a 1,000 x 100 matrix, 
which we call X, in which each row represents a gene and each col¬ 
umn a tissue sample. Each tissue sample was processed on a different 
day, and the columns of X are ordered so that the samples that were 
processed earliest are on the left, and the samples that were processed 
later are on the right. The tissue samples belong to two groups: con¬ 
trol (C) and treatment (T). The C and T samples were processed 
in a random order across the days. The researcher wishes to deter¬ 
mine whether each gene’s expression measurements differ between the 
treatment and control groups. 

As a pre-analysis (before comparing T versus C), the researcher per¬ 
forms a principal component analysis of the data, and finds that the 
first principal component (a vector of length 100) has a strong linear 
trend from left to right, and explains 10% of the variation. The re¬ 
searcher now remembers that each patient sample was run on one of 
two machines, A and B, and machine A was used more often in the 
earlier times while B was used more often later. The researcher has 
a record of which sample was run on which machine. 

(a) Explain what it means that the first principal component “ex¬ 
plains 10 % of the variation”. 

(b) The researcher decides to replace the (z, j)th element of X with 

J'ij 

where Zn is the ith score, and (pji is the jth loading, for the first 
principal component. He will then perform a two-sample t-test 
on each gene in this new data set in order to determine whether 
its expression differs between the two conditions. Critique this 
idea, and suggest a better approach. 

(c) Design and run a small simulation experiment to demonstrate 
the superiority of your idea. 
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Applied 

7. In the chapter, we mentioned the use of correlation-based distance 
and Euclidean distance as dissimilarity measures for hierarchical clus¬ 
tering. It turns out that these two measures are almost equivalent: if 
each observation has been centered to have mean zero and standard 
deviation one, and if we let r-y denote the correlation between the *th 
and jth observations, then the quantity 1 — r.y is proportional to the 
squared Euclidean distance between the ith and jth observations. 

On the USArrests data, show that this proportionality holds. 

Hint: The Euclidean distance can be calculated using the distO func¬ 
tion, and correlations can be calculated using the cor() function. 

8. In Section 10.2.3, a formula for calculating PVE was given in Equa¬ 
tion 10.8. We also saw that the PVE can be obtained using the sdev 
output of the prcompO function. 

On the USArrests data, calculate PVE in two ways: 

(a) Using the sdev output of the prcompO function, as was done in 
Section 10.2.3. 

(b) By applying Equation 10.8 directly. That is, use the prcompO 
function to compute the principal component loadings. Then, 
use those loadings in Equation 10.8 to obtain the PVE. 

These two approaches should give the same results. 

Hint: You will only obtain the same results in (a) and (b) if the same 
data is used in both cases. For instance, if in (a) you performed 
prcompO using centered and scaled variables, then you must center 
and scale the variables before applying Equation 10.3 in (b). 

9. Consider the USArrests data. We will now perform hierarchical clus¬ 
tering on the states. 

(a) Using hierarchical clustering with complete linkage and 
Euclidean distance, cluster the states. 

(b) Cut the dendrogram at a height that results in three distinct 
clusters. Which states belong to which clusters? 

(c) Hierarchically cluster the states using complete linkage and Eu¬ 
clidean distance, after scaling the variables to have standard de¬ 
viation one. 

(d) What effect does scaling the variables have on the hierarchical 
clustering obtained? In your opinion, should the variables be 
scaled before the inter-observation dissimilarities are computed? 
Provide a justification for your answer. 
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10. In this problem, you will generate simulated data, and then perform 
PCA and A'-means clustering on the data. 

(a) Generate a simulated data set with 20 observations in each of 
three classes (i.e. 60 observations total), and 50 variables. 

Hint: There are a number of functions in R that you can use to 
generate data. One example is the rnormO function; runif () is 
another option. Be sure to add a mean shift to the observations 
in each class so that there are three distinct classes. 

(b) Perform PCA on the 60 observations and plot the first two prin¬ 
cipal component score vectors. Use a different color to indicate 
the observations in each of the three classes. If the three classes 
appear separated in this plot, then continue on to part (c). If 
not, then return to part (a) and modify the simulation so that 
there is greater separation between the three classes. Do not 
continue to part (c) until the three classes show at least some 
separation in the first two principal component score vectors. 

(c) Perform A'-means clustering of the observations with AT = 3. 
How well do the clusters that you obtained in A'-means cluster¬ 
ing compare to the true class labels? 

Hint: You can use the tablet) function in R to compare the true 
class labels to the class labels obtained by clustering. Be careful 
how you interpret the results: K-means clustering will arbitrarily 
number the clusters, so you cannot simply check whether the true 
class labels and clustering labels are the same. 

(d) Perform AT-means clustering with K = 2. Describe your results. 

(e) Now perform AT-means clustering with K = 4, and describe your 
results. 

(f) Now perform AT-means clustering with K = 3 on the first two 
principal component score vectors, rather than on the raw data. 
That is, perform A'-means clustering on the 60 x 2 matrix of 
which the first column is the first principal component score 
vector, and the second column is the second principal component 
score vector. Comment on the results. 

(g) Using the scale () function, perform A'-means clustering with 
K = 3 on the data after scaling each variable to have standard 
deviation one. How do these results compare to those obtained 
in (b)? Explain. 

11. On the book website, www.StatLearning.com, there is a gene expres¬ 
sion data set (ChlOExll.csv) that consists of 40 tissue samples with 
measurements on 1,000 genes. The first 20 samples are from healthy 
patients, while the second 20 are from a diseased group. 
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(a) Load in the data using read.csvO. You will need to select 
header=F. 

(b) Apply hierarchical clustering to the samples using correlation- 
based distance, and plot the dendrogram. Do the genes separate 
the samples into the two groups? Do your results depend on the 
type of linkage used? 

(c) Your collaborator wants to know which genes differ the most 
across the two groups. Suggest a way to answer this question, 
and apply it here. 
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In God we trust, all others bring data. 


-William Edwards Deming (1900-1993) 1 


We have been gratified by the popularity of the first edition of The 
Elements of Statistical Learning. This, along with the fast pace of research 
in the statistical learning field, motivated us to update our book with a 
second edition. 

We have added four new chapters and updated some of the existing 
chapters. Because many readers are familiar with the layout of the first 
edition, we have tried to change it as little as possible. Here is a summary 
of the main changes: 


x On the Web, this quote has been widely attributed to both Deming and Robert W. 
Hayden; however Professor Hayden told us that he can claim no credit for this quote, 
and ironically we could find no “data” confirming that Deming actually said this. 
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Chapter What’s new 

1. Introduction 

2. Overview of Supervised Learning 

3. Linear Methods for Regression 

4. Linear Methods for Classification 

5. Basis Expansions and Regulariza¬ 
tion 

6. Kernel Smoothing Methods 

7. Model Assessment and Selection 

8. Model Inference and Averaging 

9. Additive Models, Trees, and 
Related Methods 

10. Boosting and Additive Trees 

11. Neural Networks 

12. Support Vector Machines and 
Flexible Discriminants 

13. Prototype Methods and 
Nearest-Neighbors 

14. Unsupervised Learning 


15. Random Forests 

16. Ensemble Learning 

17. Undirected Graphical Models 

18. High-Dimensional Problems 

Some further notes: 

• Our first edition was unfriendly to colorblind readers; in particular, 
we tended to favor red/green contrasts which are particularly trou¬ 
blesome. We have changed the color palette in this edition to a large 
extent, replacing the above with an orange/1 contrast. 

• We have changed the name of Chapter 6 from “Kernel Methods” to 
“Kernel Smoothing Methods”, to avoid confusion with the machine¬ 
learning kernel method that is discussed in the context of support vec¬ 
tor machines (Chapter 11) and more generally in Chapters 5 and 14. 

• In the first edition, the discussion of error-rate estimation in Chap¬ 
ter 7 was sloppy, as we did not clearly differentiate the notions of 
conditional error rates (conditional on the training set) and uncondi¬ 
tional rates. We have fixed this in the new edition. 


LAR algorithm and generalizations 
of the lasso 

Lasso path for logistic regression 
Additional illustrations of RKHS 


Strengths and pitfalls of cross- 
validation 


New example from ecology; some 
material split off to Chapter 16. 
Bayesian neural nets and the NIPS 
2003 challenge 

Path algorithm for SVM classifier 


Spectral clustering, kernel PCA, 

sparse PCA, non-negative matrix 

factorization archetypal analysis, 

nonlinear dimension reduction, 

Google page rank algorithm, a 

direct approach to ICA 

New 

New 

New 

New 
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• Chapters 15 and 16 follow naturally from Chapter 10, and the chap¬ 
ters are probably best read in that order. 

• In Chapter 17, we have not attempted a comprehensive treatment 
of graphical models, and discuss only undirected models and some 
new methods for their estimation. Due to a lack of space, we have 
specifically omitted coverage of directed graphical models. 

• Chapter 18 explores the “p N” problem, which is learning in high¬ 
dimensional feature spaces. These problems arise in many areas, in¬ 
cluding genomic and proteomic studies, and document classification. 

We thank the many readers who have found the (too numerous) errors in 
the first edition. We apologize for those and have done our best to avoid er¬ 
rors in this new edition. We thank Mark Segal, Bala Rajaratnam, and Larry 
Wasserman for comments on some of the new chapters, and many Stanford 
graduate and post-doctoral students who offered comments, in particular 
Mohammed AlQuraishi, John Boik, Holger Hoefling, Arian Maleki, Donal 
McMahon, Saharon Rosset, Babak Shababa, Daniela Witten, Ji Zhu and 
Hui Zou. We thank John Kimmel for his patience in guiding us through this 
new edition. RT dedicates this edition to the memory of Anna McPhee. 

Trevor Hastie 
Robert Tibshirani 
Jerome Friedman 


Stanford, California 
August 2008 
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We are drowning in information and starving for knowledge. 

Rutherford D. Roger 


The field of Statistics is constantly challenged by the problems that science 
and industry brings to its door. In the early days, these problems often came 
from agricultural and industrial experiments and were relatively small in 
scope. With the advent of computers and the information age, statistical 
problems have exploded both in size and complexity. Challenges in the 
areas of data storage, organization and searching have led to the new field 
of “data mining”; statistical and computational problems in biology and 
medicine have created “bioinformatics.” Vast amounts of data are being 
generated in many fields, and the statistician’s job is to make sense of it 
all: to extract important patterns and trends, and understand “what the 
data says.” We call this learning from data. 

The challenges in learning from data have led to a revolution in the sta¬ 
tistical sciences. Since computation plays such a key role, it is not surprising 
that much of this new development has been done by researchers in other 
fields such as computer science and engineering. 

The learning problems that we consider can be roughly categorized as 
either supervised or unsupervised. In supervised learning, the goal is to pre¬ 
dict the value of an outcome measure based on a number of input measures; 
in unsupervised learning, there is no outcome measure, and the goal is to 
describe the associations and patterns among a set of input measures. 
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This book is our attempt to bring together many of the important new 
ideas in learning, and explain them in a statistical framework. While some 
mathematical details are needed, we emphasize the methods and their con¬ 
ceptual underpinnings rather than their theoretical properties. As a result, 
we hope that this book will appeal not just to statisticians but also to 
researchers and practitioners in a wide variety of fields. 

Just as we have learned a great deal from researchers outside of the field 
of statistics, our statistical viewpoint may help others to better understand 
different aspects of learning: 

There is no true interpretation of anything; interpretation is a 
vehicle in the service of human comprehension. The value of 
interpretation is in enabling others to fruitfully think about an 
idea. 

-Andreas Buja 

We would like to acknowledge the contribution of many people to the 
conception and completion of this book. David Andrews, Leo Breiman, 
Andreas Buja, John Chambers, Bradley Efron, Geoffrey Hinton, Werner 
Stuetzle, and John Tukey have greatly influenced our careers. Balasub- 
ramanian Narasimhan gave us advice and help on many computational 
problems, and maintained an excellent computing environment. Shin-Ho 
Bang helped in the production of a number of the figures. Lee Wilkinson 
gave valuable tips on color production. liana Belitskaya, Eva Cantoni, Maya 
Gupta, Michael Jordan, Shanti Gopatam, Radford Neal, Jorge Picazo, Bog¬ 
dan Popescu, Olivier Renaud, Saharon Rosset, John Storey, Ji Zhu, Mu 
Zhu, two reviewers and many students read parts of the manuscript and 
offered helpful suggestions. John Kimmel was supportive, patient and help¬ 
ful at every phase; MaryAnn Brickner and Frank Ganz headed a superb 
production team at Springer. Trevor Hastie would like to thank the statis¬ 
tics department at the University of Cape Town for their hospitality during 
the final stages of this book. We gratefully acknowledge NSF and NIH for 
their support of this work. Finally, we would like to thank our families and 
our parents for their love and support. 

Trevor Hastie 
Robert Tibshirani 
Jerome Friedman 

Stanford, California 
May 2001 

The quiet statisticians have changed our world; not by discov¬ 
ering new facts or technical developments, but by changing the 
ways that we reason, experiment and form our opinions .... 

-Ian Hacking 
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Statistical learning plays a key role in many areas of science, finance and 
industry. Here are some examples of learning problems: 

• Predict whether a patient, hospitalized due to a heart attack, will 
have a second heart attack. The prediction is to be based on demo¬ 
graphic, diet and clinical measurements for that patient. 

• Predict the price of a stock in 6 months from now, on the basis of 
company performance measures and economic data. 

• Identify the numbers in a handwritten ZIP code, from a digitized 
image. 

• Estimate the amount of glucose in the blood of a diabetic person, 
from the infrared absorption spectrum of that person’s blood. 

• Identify the risk factors for prostate cancer, based on clinical and 
demographic variables. 

The science of learning plays a key role in the fields of statistics, data 
mining and artificial intelligence, intersecting with areas of engineering and 
other disciplines. 

This book is about learning from data. In a typical scenario, we have 
an outcome measurement, usually quantitative (such as a stock price) or 
categorical (such as heart attack/no heart attack), that we wish to predict 
based on a set of features (such as diet and clinical measurements). We 
have a training set of data, in which we observe the outcome and feature 
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TABLE 1.1. Average percentage of words or characters in an email message 
equal to the indicated word or character. We have chosen the words and characters 
showing the largest difference between spam and email. 



george 

you 

your 

hp 

free 

hpl 

! 

our 

re 

edu 

remove 

spam 

0.00 

2.26 

1.38 

0.02 

0.52 

0.01 

0.51 

0.51 

0.13 

0.01 

0.28 

email 

1.27 

1.27 

0.44 

0.90 

0.07 

0.43 

0.11 

0.18 

0.42 

0.29 

0.01 


measurements for a set of objects (such as people). Using this data we build 
a prediction model, or learner , which will enable us to predict the outcome 
for new unseen objects. A good learner is one that accurately predicts such 
an outcome. 

The examples above describe what is called the supervised learning prob¬ 
lem. It is called “supervised” because of the presence of the outcome vari¬ 
able to guide the learning process. In the unsupervised learning problem , 
we observe only the features and have no measurements of the outcome. 
Our task is rather to describe how the data are organized or clustered. We 
devote most of this book to supervised learning; the unsupervised problem 
is less developed in the literature, and is the focus of Chapter 14. 

Here are some examples of real learning problems that are discussed in 
this book. 


Example 1: Email Spam 

The data for this example consists of information from 4601 email mes¬ 
sages, in a study to try to predict whether the email was junk email, or 
“spam.” The objective was to design an automatic spam detector that 
could filter out spam before clogging the users’ mailboxes. For all 4601 
email messages, the true outcome (email type) email or spam is available, 
along with the relative frequencies of 57 of the most commonly occurring 
words and punctuation marks in the email message. This is a supervised 
learning problem, with the outcome the class variable email/spam. It is also 
called a classification problem. 

Table 1.1 lists the words and characters showing the largest average 
difference between spam and email. 

Our learning method has to decide which features to use and how: for 
example, we might use a rule such as 

if (“/.george < 0.6) & (’/.you > 1.5) then spam 

else email. 

Another form of a rule might be: 

if (0.2 • '/.you — 0.3 • "/.george) > 0 then spam 

else email. 
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FIGURE 1.1. Scatterplot matrix of the prostate cancer data. The first row shows 
the response against each of the predictors in turn. Two of the predictors, svi and 
gleason, are categorical. 

For this problem not all errors are equal; we want to avoid filtering out 
good email, while letting spam get through is not desirable but less serious 
in its consequences. We discuss a number of different methods for tackling 
this learning problem in the book. 


Example 2: Prostate Cancer 

The data for this example, displayed in Figure 1.1 1 , come from a study 
by Stamey et al. (1989) that examined the correlation between the level of 


1 There was an error in these data in the first edition of this book. Subject 32 had 
a value of 6.1 for lweight, which translates to a 449 gm prostate! The correct value is 
44.9 gm. We are grateful to Prof. Stephen W. Link for alerting us to this error. 
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FIGURE 1.2. Examples of handwritten digits from U.S. postal envelopes. 


prostate specific antigen (PSA) and a number of clinical measures, in 97 
men who were about to receive a radical prostatectomy. 

The goal is to predict the log of PSA (ipsa) from a number of measure¬ 
ments including log cancer volume (lcavol), log prostate weight lweight, 
age, log of benign prostatic hyperplasia amount lbph, seminal vesicle in¬ 
vasion svi, log of capsular penetration lcp, Gleason score gleason, and 
percent of Gleason scores 4 or 5 pgg45. Figure 1.1 is a scatterplot matrix 
of the variables. Some correlations with Ipsa are evident, but a good pre¬ 
dictive model is difficult to construct by eye. 

This is a supervised learning problem, known as a regression problem , 
because the outcome measurement is quantitative. 


Example 3: Handwritten Digit Recognition 

The data from this example come from the handwritten ZIP codes on 
envelopes from U.S. postal mail. Each image is a segment from a five digit 
ZIP code, isolating a single digit. The images are 16 x 16 eight-bit grayscale 
maps, with each pixel ranging in intensity from 0 to 255. Some sample 
images are shown in Figure 1.2. 

The images have been normalized to have approximately the same size 
and orientation. The task is to predict, from the 16 x 16 matrix of pixel 
intensities, the identity of each image (0,1,..., 9) quickly and accurately. If 
it is accurate enough, the resulting algorithm would be used as part of an 
automatic sorting procedure for envelopes. This is a classification problem 
for which the error rate needs to be kept very low to avoid misdirection of 
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mail. In order to achieve this low error rate, some objects can be assigned 
to a “don’t know” category, and sorted instead by hand. 

Example 4-' DNA Expression Microarrays 

DNA stands for deoxyribonucleic acid, and is the basic material that makes 
up human chromosomes. DNA microarrays measure the expression of a 
gene in a cell by measuring the amount of mRNA (messenger ribonucleic 
acid) present for that gene. Microarrays are considered a breakthrough 
technology in biology, facilitating the quantitative study of thousands of 
genes simultaneously from a single sample of cells. 

Here is how a DNA microarray works. The nucleotide sequences for a few 
thousand genes are printed on a glass slide. A target sample and a reference 
sample are labeled with red and green dyes, and each are hybridized with 
the DNA on the slide. Through fluoroscopy, the log (red/green) intensities 
of RNA hybridizing at each site is measured. The result is a few thousand 
numbers, typically ranging from say —6 to 6, measuring the expression level 
of each gene in the target relative to the reference sample. Positive values 
indicate higher expression in the target versus the reference, and vice versa 
for negative values. 

A gene expression dataset collects together the expression values from a 
series of DNA microarray experiments, with each column representing an 
experiment. There are therefore several thousand rows representing individ¬ 
ual genes, and tens of columns representing samples: in the particular ex¬ 
ample of Figure 1.3 there are 6830 genes (rows) and 64 samples (columns), 
although for clarity only a random sample of 100 rows are shown. The fig¬ 
ure displays the data set as a heat map, ranging from green (negative) to 
red (positive). The samples are 64 cancer tumors from different patients. 

The challenge here is to understand how the genes and samples are or¬ 
ganized. Typical questions include the following: 

(a) which samples are most similar to each other, in terms of their expres¬ 

sion profiles across genes? 

(b) which genes are most similar to each other, in terms of their expression 

profiles across samples? 

(c) do certain genes show very high (or low) expression for certain cancer 

samples? 

We could view this task as a regression problem, with two categorical 
predictor variables—genes and samples—with the response variable being 
the level of expression. However, it is probably more useful to view it as 
unsupervised learning problem. For example, for question (a) above, we 
think of the samples as points in 6830-dimensional space, which we want 
to cluster together in some way. 
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FIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows) 
and 64 samples (columns), for the human tumor data. Only a random sample 
of 100 rows are shown. The display is a heat map, ranging from bright green 
(negative, under expressed) to bright red (positive, over expressed). Missing values 
are gray. The rows and columns are displayed in a randomly chosen order. 
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Who Should Read this Book 

This book is designed for researchers and students in a broad variety of 
fields: statistics, artificial intelligence, engineering, finance and others. We 
expect that the reader will have had at least one elementary course in 
statistics, covering basic topics including linear regression. 

We have not attempted to write a comprehensive catalog of learning 
methods, but rather to describe some of the most important techniques. 
Equally notable, we describe the underlying concepts and considerations 
by which a researcher can judge a learning method. We have tried to write 
this book in an intuitive fashion, emphasizing concepts rather than math¬ 
ematical details. 

As statisticians, our exposition will naturally reflect our backgrounds and 
areas of expertise. However in the past eight years we have been attending 
conferences in neural networks, data mining and machine learning, and our 
thinking has been heavily influenced by these exciting fields. This influence 
is evident in our current research, and in this book. 


How This Book is Organized 

Our view is that one must understand simple methods before trying to 
grasp more complex ones. Hence, after giving an overview of the supervis¬ 
ing learning problem in Chapter 2, we discuss linear methods for regression 
and classification in Chapters 3 and 4. In Chapter 5 we describe splines, 
wavelets and regularization/penalization methods for a single predictor, 
while Chapter 6 covers kernel methods and local regression. Both of these 
sets of methods are important building blocks for high-dimensional learn¬ 
ing techniques. Model assessment and selection is the topic of Chapter 7, 
covering the concepts of bias and variance, overfitting and methods such as 
cross-validation for choosing models. Chapter 8 discusses model inference 
and averaging, including an overview of maximum likelihood, Bayesian in¬ 
ference and the bootstrap, the EM algorithm, Gibbs sampling and bagging, 
A related procedure called boosting is the focus of Chapter 10. 

In Chapters 9-13 we describe a series of structured methods for su¬ 
pervised learning, with Chapters 9 and 11 covering regression and Chap¬ 
ters 12 and 13 focusing on classification. Chapter 14 describes methods for 
unsupervised learning. Two recently proposed techniques, random forests 
and ensemble learning, are discussed in Chapters 15 and 16. We describe 
undirected graphical models in Chapter 17 and finally we study high¬ 
dimensional problems in Chapter 18. 

At the end of each chapter we discuss computational considerations im¬ 
portant for data mining applications, including how the computations scale 
with the number of observations and predictors. Each chapter ends with 
Bibliographic Notes giving background references for the material. 
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We recommend that Chapters 1-4 be first read in sequence. Chapter 7 
should also be considered mandatory, as it covers central concepts that 
pertain to all learning methods. With this in mind, the rest of the book 
can be read sequentially, or sampled, depending on the reader’s interest. 

The symbol indicates a technically difficult section, one that can 

be skipped without interrupting the flow of the discussion. 


Book Website 

The website for this book is located at 

http://www-stat.Stanford.edu/ElemStatLearn 

It contains a number of resources, including many of the datasets used in 
this book. 

Note for Instructors 

We have successively used the first edition of this book as the basis for a 
two-quarter course, and with the additional materials in this second edition, 
it could even be used for a three-quarter sequence. Exercises are provided at 
the end of each chapter. It is important for students to have access to good 
software tools for these topics. We used the R and S-PLUS programming 
languages in our courses. 
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2.1 Introduction 

The first three examples described in Chapter 1 have several components 
in common. For each there is a set of variables that might be denoted as 
inputs , which are measured or preset. These have some influence on one or 
more outputs. For each example the goal is to use the inputs to predict the 
values of the outputs. This exercise is called supervised learning. 

We have used the more modern language of machine learning. In the 
statistical literature the inputs are often called the predictors, a term we 
will use interchangeably with inputs, and more classically the independent 
variables. In the pattern recognition literature the term features is preferred, 
which we use as well. The outputs are called the responses, or classically 
the dependent variables. 


2.2 Variable Types and Terminology 

The outputs vary in nature among the examples. In the glucose prediction 
example, the output is a quantitative measurement, where some measure¬ 
ments are bigger than others, and measurements close in value are close 
in nature. In the famous Iris discrimination example due to R. A. Fisher, 
the output is qualitative (species of Iris) and assumes values in a finite set 
Q = { Virginica, Setosa and Versicolor}. In the handwritten digit example 
the output is one of 10 different digit classes: Q = {0,1,..., 9}. In both of 
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these there is no explicit ordering in the classes, and in fact often descrip¬ 
tive labels rather than numbers are used to denote the classes. Qualitative 
variables are also referred to as categorical or discrete variables as well as 
factors. 

For both types of outputs it makes sense to think of using the inputs to 
predict the output. Given some specific atmospheric measurements today 
and yesterday, we want to predict the ozone level tomorrow. Given the 
grayscale values for the pixels of the digitized image of the handwritten 
digit, we want to predict its class label. 

This distinction in output type has led to a naming convention for the 
prediction tasks: regression when we predict quantitative outputs, and clas¬ 
sification when we predict qualitative outputs. We will see that these two 
tasks have a lot in common, and in particular both can be viewed as a task 
in function approximation. 

Inputs also vary in measurement type; we can have some of each of qual¬ 
itative and quantitative input variables. These have also led to distinctions 
in the types of methods that are used for prediction: some methods are 
defined most naturally for quantitative inputs, some most naturally for 
qualitative and some for both. 

A third variable type is ordered categorical , such as small, medium and 
large , where there is an ordering between the values, but no metric notion 
is appropriate (the difference between medium and small need not be the 
same as that between large and medium). These are discussed further in 
Chapter 4. 

Qualitative variables are typically represented numerically by codes. The 
easiest case is when there are only two classes or categories, such as “suc¬ 
cess” or “failure,” “survived” or “died.” These are often represented by a 
single binary digit or bit as 0 or 1, or else by —1 and 1. For reasons that will 
become apparent, such numeric codes are sometimes referred to as targets. 
When there are more than two categories, several alternatives are available. 
The most useful and commonly used coding is via dummy variables. Here a 
It-level qualitative variable is represented by a vector of K binary variables 
or bits, only one of which is “on” at a time. Although more compact coding 
schemes are possible, dummy variables are symmetric in the levels of the 
factor. 

We will typically denote an input variable by the symbol X. If X is 
a vector, its components can be accessed by subscripts Xj. Quantitative 
outputs will be denoted by Y , and qualitative outputs by G (for group). 
We use uppercase letters such as X, Y or G when referring to the generic 
aspects of a variable. Observed values are written in lowercase; hence the 
ith observed value of X is written as ay (where ay is again a scalar or 
vector). Matrices are represented by bold uppercase letters; for example, a 
set of N input p -vectors ay, i = 1,... ,N would be represented by the Nxp 
matrix X. In general, vectors will not be bold, except when they have N 
components; this convention distinguishes a p -vector of inputs ay for the 
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ith observation from the iV-vector x ? consisting of all the observations on 
variable Xj. Since all vectors are assumed to be column vectors, the ith 
row of X is xj. the vector transpose of Xi. 

For the moment we can loosely state the learning task as follows: given 
the value of an input vector X , make a good prediction of the output Y. 
denoted by Y (pronounced “y-hat”). If Y takes values in 1R then so should 
Y ; likewise for categorical outputs, G should take values in the same set Q 
associated with G. 

For a two-class G, one approach is to denote the binary coded target 
as Y, and then treat it as a quantitative output. The predictions Y will 
typically lie in [0,1], and we can assign to G the class label according to 
whether y > 0.5. This approach generalizes to itT-level qualitative outputs 
as well. 

We need data to construct prediction rules, often a lot of it. We thus 
suppose we have available a set of measurements or (xi,gi), i = 

1 ,,N, known as the training data , with which to construct our prediction 
rule. 


2.3 Two Simple Approaches to Prediction: Least 
Squares and Nearest Neighbors 

In this section we develop two simple but powerful prediction methods: the 
linear model fit by least squares and the fc-nearest-neighbor prediction rule. 
The linear model makes huge assumptions about structure and yields stable 
but possibly inaccurate predictions. The method of /c-nearest neighbors 
makes very mild structural assumptions: its predictions are often accurate 
but can be unstable. 

2.3.1 Linear Models and Least Squares 

The linear model has been a mainstay of statistics for the past 30 years 
and remains one of our most important tools. Given a vector of inputs 
X T = (A'i, X 2 ,..., X p ), we predict the output Y via the model 

l = (2-i) 

i=i 

The term /3q is the intercept, also known as the bias in machine learning. 
Often it is convenient to include the constant variable 1 in X, include po in 
the vector of coefficients /3, and then write the linear model in vector form 
as an inner product 

Y = X T p, 


(2.2) 
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where X T denotes vector or matrix transpose (X being a column vector). 
Here we are modeling a single output, so Y is a scalar; in general Y can be 
a AT-vector, in which case /3 would be a px K matrix of coefficients. In the 
( p + l)-dimensional input-output space, ( X , Y) represents a hyperplane. 
If the constant is included in X, then the hyperplane includes the origin 
and is a subspace; if not, it is an affine set cutting the V-axis at the point 
(0, ^o)- From now on we assume that the intercept is included in /3. 

Viewed as a function over the p-dimensional input space, f(X) = X T f3 
is linear, and the gradient f'(X) = (3 is a vector in input space that points 
in the steepest uphill direction. 

How do we fit the linear model to a set of training data? There are 
many different methods, but by far the most popular is the method of 
least squares. In this approach, we pick the coefficients /? to minimize the 
residual sum of squares 

N 

RSS^^-xf/?) 2 . (2.3) 

i=l 

RSS(/3) is a quadratic function of the parameters, and hence its minimum 
always exists, but may not be unique. The solution is easiest to characterize 
in matrix notation. We can write 

RSS(/3) = (y - X/3) T (y - X/3), (2.4) 

where X is an N X p matrix with each row an input vector, and y is an 
N- vector of the outputs in the training set. Differentiating w.r.t. /3 we get 
the normal equations 

X T (y — X/3) = 0. (2.5) 

If X T X is nonsingular, then the unique solution is given by 

/? = (X T X)" 1 X T y, (2.6) 

and the fitted value at the ith input X; is iji = y(xi) = xj(3. At an arbi¬ 
trary input xo the prediction is y(x o) = xjp. The entire fitted surface is 
characterized by the p parameters /3. Intuitively, it seems that we do not 
need a very large data set to fit such a model. 

Let’s look at an example of the linear model in a classification context. 
Figure 2.1 shows a scatterplot of training data on a pair of inputs Xi and 
X 2 . The data are simulated, and for the present the simulation model is 
not important. The output class variable G has the values or ORANGE, 
and is represented as such in the scatterplot. There are 100 points in each 
of the two classes. The linear regression model was fit to these data, with 
the response Y coded as 0 for BLUE and 1 for ORANGE. The fitted values Y 
are converted to a fitted class variable G according to the rule 


G = 


ORANGE if Y > 0.5, 
BLUE if Y < 0.5. 


(2.7) 
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Linear Regression of 0/1 Response 



FIGURE 2.1. A classification example in two dimensions. The classes are coded 
as a binary variable (BLl = 0, ORANGE = 1 ), and then fit by linear regression. 
The line is the decision boundary defined by x T = 0.5. The orange shaded region 
denotes that part of input space classified as ORANGE, while the blue region is 
classified as 

The set of points in IR 2 classified as ORANGE corresponds to {x : x T $ > 0.5}, 
indicated in Figure 2.1, and the two predicted classes are separated by the 
decision boundary {x : x T = 0.5}, which is linear in this case. We see 
that for these data there are several misclassifications on both sides of the 
decision boundary. Perhaps our linear model is too rigid— or are such errors 
unavoidable? Remember that these are errors on the training data itself, 
and we have not said where the constructed data came from. Consider the 
two possible scenarios: 

Scenario 1 : The training data in each class were generated from bivariate 
Gaussian distributions with uncorrelated components and different 
means. 

Scenario 2: The training data in each class came from a mixture of 10 low- 
variance Gaussian distributions, with individual means themselves 
distributed as Gaussian. 

A mixture of Gaussians is best described in terms of the generative 
model. One first generates a discrete variable that determines which of 
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the component Gaussians to use, and then generates an observation from 
the chosen density. In the case of one Gaussian per class, we will see in 
Chapter 4 that a linear decision boundary is the best one can do, and that 
our estimate is almost optimal. The region of overlap is inevitable, and 
future data to be predicted will be plagued by this overlap as well. 

In the case of mixtures of tightly clustered Gaussians the story is dif¬ 
ferent. A linear decision boundary is unlikely to be optimal, and in fact is 
not. The optimal decision boundary is nonlinear and disjoint, and as such 
will be much more difficult to obtain. 

We now look at another classification and regression procedure that is 
in some sense at the opposite end of the spectrum to the linear model, and 
far better suited to the second scenario. 

2.3.2 Nearest-Neighbor Methods 

Nearest-neighbor methods use those observations in the training set T clos¬ 
est in input space to x to form Y. Specifically, the fc-nearest neighbor fit 
for Y is defined as follows: 



( 2 . 8 ) 


i>£ Nk(x) 


where Nk(x) is the neighborhood of x defined by the k closest points Xi in 
the training sample. Closeness implies a metric, which for the moment we 
assume is Euclidean distance. So, in words, we find the k observations with 
Xi closest to x in input space, and average their responses. 

In Figure 2.2 we use the same training data as in Figure 2.1, and use 
15-nearest-neighbor averaging of the binary coded response as the method 
of fitting. Thus Y is the proportion of ORANGE’S in the neighborhood, and 
so assigning class ORANGE to G if Y >0.5 amounts to a majority vote in 
the neighborhood. The colored regions indicate all those points in input 
space classified as or ORANGE by such a rule, in this case found by 
evaluating the procedure on a fine grid in input space. We see that the 
decision boundaries that separate the BLt from the ORANGE regions are far 
more irregular, and respond to local clusters where one class dominates. 

Figure 2.3 shows the results for 1-nearest-neighbor classification: Y is 
assigned the value y£ of the closest point Xi to x in the training data. In 
this case the regions of classification can be computed relatively easily, and 
correspond to a Voronoi tessellation of the training data. Each point Xi 
has an associated tile bounding the region for which it is the closest input 
point. For all points x in the tile, G(x ) = gi. The decision boundary is even 
more irregular than before. 

The method of fc-nearest-neighbor averaging is defined in exactly the 
same way for regression of a quantitative output Y, although k = 1 would 
be an unlikely choice. 
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15-Nearest Neighbor Classifier 



FIGURE 2.2. The same classification example in two dimensions as in Fig¬ 
ure 2.1. The classes are coded as a binary variable ( = 0, ORANGE = 1) and 

then fit by 15 -nearest-neighbor averaging as in (2.8). The predicted class is hence 
chosen by majority vote amongst the 15-nearest neighbors. 

In Figure 2.2 we see that far fewer training observations are misclassified 
than in Figure 2.1. This should not give us too much comfort, though, since 
in Figure 2.3 none of the training data are misclassified. A little thought 
suggests that for fc-nearest-neighbor fits, the error on the training data 
should be approximately an increasing function of fc, and will always be 0 
for k = 1. An independent test set would give us a more satisfactory means 
for comparing the different methods. 

It appears that fc-nearest-neighbor fits have a single parameter, the num¬ 
ber of neighbors fc, compared to the p parameters in least-squares fits. Al¬ 
though this is the case, we will see that the effective number of parameters 
of fc-nearest neighbors is N/k and is generally bigger than p, and decreases 
with increasing k. To get an idea of why, note that if the neighborhoods 
were nonoverlapping, there would be N/k neighborhoods and we would fit 
one parameter (a mean) in each neighborhood. 

It is also clear that we cannot use sum-of-squared errors on the training 
set as a criterion for picking k , since we would always pick k = 1! It would 
seem that fc-nearest-neighbor methods would be more appropriate for the 
mixture Scenario 2 described above, while for Gaussian data the decision 
boundaries of fc-nearest neighbors would be unnecessarily noisy. 
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1-Nearest Neighbor Classifier 



FIGURE 2.3. The same classification example in two dimensions as in Fig¬ 
ure 2.1. The classes are coded as a binary variable (BLT = 0, ORANGE = 1), and 
then predicted by 1-nearest-neighbor classification. 

2.3.3 From Least Squares to Nearest Neighbors 

The linear decision boundary from least squares is very smooth, and ap¬ 
parently stable to fit. It does appear to rely heavily on the assumption 
that a linear decision boundary is appropriate. In language we will develop 
shortly, it has low variance and potentially high bias. 

On the other hand, the fc-nearest-neighbor procedures do not appear to 
rely on any stringent assumptions about the underlying data, and can adapt 
to any situation. However, any particular subregion of the decision bound¬ 
ary depends on a handful of input points and their particular positions, 
and is thus wiggly and unstable—high variance and low bias. 

Each method has its own situations for which it works best; in particular 
linear regression is more appropriate for Scenario 1 above, while nearest 
neighbors are more suitable for Scenario 2. The time has come to expose 
the oracle! The data in fact were simulated from a model somewhere be¬ 
tween the two, but closer to Scenario 2. First we generated 10 means rrik 
from a bivariate Gaussian distribution 7V((1, 0) T , I) and labeled this class 
. Similarly, 10 more were drawn from iV((0, 1) T ,I) and labeled class 
ORANGE. Then for each class we generated 100 observations as follows: for 
each observation, we picked an nik at random with probability 1/10, and 
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k - Number of Nearest Neighbors 

151 101 69 45 31 21 11 7 5 3 1 



FIGURE 2.4. Misclassification curves for the simulation example used in Fig¬ 
ures 2.1, 2.2 and 2.3. A single training sample of size 200 was used, and a test 
sample of size 10,000. The orange curves are test and the blue are training er¬ 
ror for k-nearest-neighbor classification. The results for linear regression are the 
bigger orange and blue squares at three degrees of freedom. The purple line is the 
optimal Bayes error rate. 


then generated a N(rrik, 1/5), thus leading to a mixture of Gaussian clus¬ 
ters for each class. Figure 2.4 shows the results of classifying 10,000 new 
observations generated from the model. We compare the results for least 
squares and those for fc-nearest neighbors for a range of values of k. 

A large subset of the most popular techniques in use today are variants of 
these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, 
captures a large percentage of the market for low-dimensional problems. 
The following list describes some ways in which these simple procedures 
have been enhanced: 

• Kernel methods use weights that decrease smoothly to zero with dis¬ 
tance from the target point, rather than the effective 0/1 weights used 
by fc-nearest neighbors. 

• In high-dimensional spaces the distance kernels are modified to em¬ 
phasize some variable more than others. 
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• Local regression fits linear models by locally weighted least squares, 
rather than fitting constants locally. 

• Linear models fit to a basis expansion of the original inputs allow 
arbitrarily complex models. 

• Projection pursuit and neural network models consist of sums of non¬ 
linear ly transformed linear models. 


2.4 Statistical Decision Theory 

In this section we develop a small amount of theory that provides a frame¬ 
work for developing models such as those discussed informally so far. We 
first consider the case of a quantitative output, and place ourselves in the 
world of random variables and probability spaces. Let X £ IR P denote a 
real valued random input vector, and Y £ 1R a real valued random out¬ 
put variable, with joint distribution Pr(X, Y). We seek a function f(X) 
for predicting Y given values of the input X. This theory requires a loss 
function L(Y ., f(X)) for penalizing errors in prediction, and by far the most 
common and convenient is squared error loss: L(Y, f(X)) = (Y — /(X)) 2 . 
This leads us to a criterion for choosing /, 


EPE(/) = E (Y-f(X)) 2 

(2.9) 

= J [y ~ f(x)] 2 Pr(dx,dy), 

(2.10) 

the expected (squared) prediction error . By conditioning 1 
write EPE as 

on X , we can 

epe(/) = e x e y]x ([y - f(X)] 2 \X) 

(2.11) 

and we see that it suffices to minimize EPE pointwise: 


/(x) = argmin c Eyq.Y ([Y - c] 2 \X = x) . 

(2.12) 

The solution is 


/(x) = E(F|X = x), 

(2.13) 


the conditional expectation, also known as the regression function. Thus 
the best prediction of Y at any point X = x is the conditional mean, when 
best is measured by average squared error. 

The nearest-neighbor methods attempt to directly implement this recipe 
using the training data. At each point x, we might ask for the average of all 


1 Conditioning here amounts to factoring the joint density Pr(.Y, Y) = Pr(Y|.Y)Pr(JY) 

where Pr(Y|JY) = Pr(Y, X)/Pr(X), and splitting up the bivariate integral accordingly. 
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those yiS with input x* = x. Since there is typically at most one observation 
at any point x, we settle for 

f{x) = Ave(j/i|xj G N k (x)), (2.14) 

where “Ave” denotes average, and Nk(x) is the neighborhood containing 
the fc points in T closest to x. Two approximations are happening here: 

• expectation is approximated by averaging over sample data; 

• conditioning at a point is relaxed to conditioning on some region 
“close” to the target point. 

For large training sample size N, the points in the neighborhood are likely 
to be close to x, and as fc gets large the average will get more stable. 
In fact, under mild regularity conditions on the joint probability distri¬ 
bution Pr(A, Y), one can show that as N, fc — > oo such that k/N —> 0, 
/(x) —1 E(Y\X = x). In light of this, why look further, since it seems 
we have a universal approximator? We often do not have very large sam¬ 
ples. If the linear or some more structured model is appropriate, then we 
can usually get a more stable estimate than fc-nearest neighbors, although 
such knowledge has to be learned from the data as well. There are other 
problems though, sometimes disastrous. In Section 2.5 we see that as the 
dimension p gets large, so does the metric size of the fc-nearest neighbor¬ 
hood. So settling for nearest neighborhood as a surrogate for conditioning 
will fail us miserably. The convergence above still holds, but the rate of 
convergence decreases as the dimension increases. 

How does linear regression fit into this framework? The simplest explana¬ 
tion is that one assumes that the regression function /(x) is approximately 
linear in its arguments: 

fix) « x t /3. (2.15) 

This is a model-based approach—we specify a model for the regression func¬ 
tion. Plugging this linear model for /(x) into EPE (2.9) and differentiating 
we can solve for /? theoretically: 

/3 =[E(XX T )]~ 1 E(XY). (2.16) 

Note we have not conditioned on X ; rather we have used our knowledge 
of the functional relationship to pool over values of X. The least squares 
solution (2.6) amounts to replacing the expectation in (2.16) by averages 
over the training data. 

So both fc-nearest neighbors and least squares end up approximating 
conditional expectations by averages. But they differ dramatically in terms 
of model assumptions: 

• Least squares assumes /(x) is well approximated by a globally linear 
function. 
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• /e-nearest neighbors assumes f{x) is well approximated by a locally 
constant function. 

Although the latter seems more palatable, we have already seen that we 
may pay a price for this flexibility. 

Many of the more modern techniques described in this book are model 
based, although far more flexible than the rigid linear model. For example, 
additive models assume that 

/P0=X>P0)- (2-17) 

i=i 

This retains the additivity of the linear model, but each coordinate function 
fj is arbitrary. It turns out that the optimal estimate for the additive model 
uses techniques such as fc-nearest neighbors to approximate univariate con¬ 
ditional expectations simultaneously for each of the coordinate functions. 
Thus the problems of estimating a conditional expectation in high dimen¬ 
sions are swept away in this case by imposing some (often unrealistic) model 
assumptions, in this case additivity. 

Are we happy with the criterion (2.11)? What happens if we replace the 
L 2 loss function with the L\\ E\Y — f(X )|? The solution in this case is the 
conditional median, 


f(x) = median(F|X = x), (2-18) 

which is a different measure of location, and its estimates are more robust 
than those for the conditional mean. L\ criteria have discontinuities in 
their derivatives, which have hindered their widespread use. Other more 
resistant loss functions will be mentioned in later chapters, but squared 
error is analytically convenient and the most popular. 

What do we do when the output is a categorical variable G? The same 
paradigm works here, except we need a different loss function for penalizing 
prediction errors. An estimate G will assume values in Q , the set of possible 
classes. Our loss function can be represented by a K x K matrix L, where 
K = card(t/). L will be zero on the diagonal and nonnegative elsewhere, 
where L(k,£) is the price paid for classifying an observation belonging to 
class G k as Qi. Most often we use the zero-one loss function, where all 
misclassifications are charged a single unit. The expected prediction error 
is 

EPE = E[L(G,G(X))], (2.19) 

where again the expectation is taken with respect to the joint distribution 
Pr(G, X). Again we condition, and can write EPE as 

K 

EPE = Ex J2 L[G k ,G(X)]Pv(g k \X) 

1 


( 2 . 20 ) 
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FIGURE 2.5. The optimal Bayes decision boundary for the simulation example 
of Figures 2.1, 2.2 and 2.3. Since the generating density is known for each class, 
this boundary can be calculated exactly (Exercise 2.2). 


and again it suffices to minimize EPE pointwise: 

K 

G(x) = argmin fl6g ^ L(Q k ,g)Pr(Q k \X = x ). (2.21) 

k =1 

With the 0-1 loss function this simplifies to 

GO) = argmin ffgg [l - Pr(g|X = x)\ (2.22) 

or simply 

G(x) = Qk if Pr(<7fc|^ = x) = maxPr(^|X = x). (2.23) 

g&S 

This reasonable solution is known as the Bayes classifier , and says that 
we classify to the most probable class, using the conditional (discrete) dis¬ 
tribution Pr(G|X). Figure 2.5 shows the Bayes-optimal decision boundary 
for our simulation example. The error rate of the Bayes classifier is called 
the Bayes rate. 
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Again we see that the fc-nearest neighbor classifier directly approximates 
this solution -a majority vote in a nearest neighborhood amounts to ex¬ 
actly this, except that conditional probability at a point is relaxed to con¬ 
ditional probability within a neighborhood of a point, and probabilities are 
estimated by training-sample proportions. 

Suppose for a two-class problem we had taken the dummy-variable ap¬ 
proach and coded G via a binary Y, followed by squared error loss estima¬ 
tion. Then f(X) = E(YjX) = Pr(G = Qi\X) if Q\ corresponded to Y = 1. 
Likewise for a A'-class problem, E(Y/-|X) = Pr(G = Q^\X). This shows 
that our dummy-variable regression procedure, followed by classification to 
the largest fitted value, is another way of representing the Bayes classifier. 
Although this theory is exact, in practice problems can occur, depending 
on the regression model used. For example, when linear regression is used, 
f(X) need not be positive, and we might be suspicious about using it as 
an estimate of a probability. We will discuss a variety of approaches to 
modeling Pr(G|X) in Chapter 4. 


2.5 Local Methods in High Dimensions 

We have examined two learning techniques for prediction so far: the stable 
but biased linear model and the less stable but apparently less biased class 
of fc-nearest-neighbor estimates. It would seem that with a reasonably large 
set of training data, we could always approximate the theoretically optimal 
conditional expectation by fc-nearest-neighbor averaging, since we should 
be able to find a fairly large neighborhood of observations close to any x 
and average them. This approach and our intuition breaks down in high 
dimensions, and the phenomenon is commonly referred to as the curse 
of dimensionality (Bellman, 1961). There are many manifestations of this 
problem, and we will examine a few here. 

Consider the nearest-neighbor procedure for inputs uniformly distributed 
in a p -dimensional unit hypercube, as in Figure 2.6. Suppose we send out a 
hypercubical neighborhood about a target point to capture a fraction r of 
the observations. Since this corresponds to a fraction r of the unit volume, 
the expected edge length will be e p (r) = r x ' v . In ten dimensions eio(O.Ol) = 
0.63 and eio(O.l) = 0.80, while the entire range for each input is only 1.0. 
So to capture 1% or 10% of the data to form a local average, we must cover 
63% or 80% of the range of each input variable. Such neighborhoods are no 
longer “local.” Reducing r dramatically does not help much either, since 
the fewer observations we average, the higher is the variance of our fit. 

Another consequence of the sparse sampling in high dimensions is that 
all sample points are close to an edge of the sample. Consider N data points 
uniformly distributed in a p-dimensional unit ball centered at the origin. 
Suppose we consider a nearest-neighbor estimate at the origin. The median 
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FIGURE 2.6. The curse of dimensionality is well illustrated by a subcubical 
neighborhood for uniform data in a unit cube. The figure on the right shows the 
side-length of the subcube needed to capture a fraction r of the volume of the data , 
for different dimensions p. In ten dimensions we need to cover 80% of the range 
of each coordinate to capture 10% of the data. 

distance from the origin to the closest data point is given by the expression 



(2.24) 


(Exercise 2.3). A more complicated expression exists for the mean distance 
to the closest point. For N = 500, p = 10 , d(p,N ) ~ 0.52, more than 
halfway to the boundary. Hence most data points are closer to the boundary 
of the sample space than to any other data point. The reason that this 
presents a problem is that prediction is much more difficult near the edges 
of the training sample. One must extrapolate from neighboring sample 
points rather than interpolate between them. 

Another manifestation of the curse is that the sampling density is pro¬ 
portional to iV 1/p , where p is the dimension of the input space and N is the 
sample size. Thus, if N\ = 100 represents a dense sample for a single input 
problem, then N\o = 100 10 is the sample size required for the same sam¬ 
pling density with 10 inputs. Thus in high dimensions all feasible training 
samples sparsely populate the input space. 

Let us construct another uniform example. Suppose we have 1000 train¬ 
ing examples Xi generated uniformly on [—l,l] p . Assume that the true 
relationship between X and Y is 



without any measurement error. We use the 1-nearest-neighbor rule to 
predict y 0 at the test-point Xq = 0. Denote the training set by T. We can 
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compute the expected prediction error at xq for our procedure, averaging 
over all such samples of size 1000. Since the problem is deterministic, this 
is the mean squared error (MSE) for estimating /(0): 


MSE(xo) = E r [/(x 0 ) - y 0 ] 2 

= E r[j/o ~ E r(yo)] 2 + [ E r(?/o) - f( x o)] 2 
= Var r (y 0 ) + Bias 2 (y 0 ). (2.25) 


Figure 2.7 illustrates the setup. We have broken down the MSE into two 
components that will become familiar as we proceed: variance and squared 
bias. Such a decomposition is always possible and often useful, and is known 
as the bias-variance decomposition. Unless the nearest neighbor is at 0, 
y 0 will be smaller than /(0) in this example, and so the average estimate 
will be biased downward. The variance is due to the sampling variance of 
the 1-nearest neighbor. In low dimensions and with N = 1000, the nearest 
neighbor is very close to 0, and so both the bias and variance are small. As 
the dimension increases, the nearest neighbor tends to stray further from 
the target point, and both bias and variance are incurred. By p = 10, for 
more than 99% of the samples the nearest neighbor is a distance greater 
than 0.5 from the origin. Thus as p increases, the estimate tends to be 0 
more often than not, and hence the MSE levels off at 1.0, as does the bias, 
and the variance starts dropping (an artifact of this example). 

Although this is a highly contrived example, similar phenomena occur 
more generally. The complexity of functions of many variables can grow 
exponentially with the dimension, and if we wish to be able to estimate 
such functions with the same accuracy as function in low dimensions, then 
we need the size of our training set to grow exponentially as well. In this 
example, the function is a complex interaction of all p variables involved. 

The dependence of the bias term on distance depends on the truth, and 
it need not always dominate with 1-nearest neighbor. For example, if the 
function always involves only a few dimensions as in Figure 2.8, then the 
variance can dominate instead. 

Suppose, on the other hand, that we know that the relationship between 
Y and X is linear, 


Y = X t /3 + e, 


(2.26) 


where e ~ N(0,a 2 ) and we fit the model by least squares to the train¬ 
ing data. For an arbitrary test point xo, we have yo = Xgp, which can 
be written as yo = Xq/3 + '£2iLi^i( x o) £ ii where £i(x o) is the ith element 
of X(X t X) _1 Xq. Since under this model the least squares estimates are 
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Dimension 


Dimension 


FIGURE 2.7. A simulation example, demonstrating the curse of dimensional¬ 
ity and its effect on MSE, bias and variance. The input features are uniformly 
distributed in [—1, l] p for p = 1,..., 10 The top left panel shows the target func¬ 
tion (no noise) in IR: f(X) = e~ 8 ^ x ^ , and demonstrates the error that 1-nearest 
neighbor makes in estimating /(0). The training point is indicated by the blue tick 
mark. The top right panel illustrates why the radius of the 1-nearest neighborhood 
increases with dimension p. The lower left panel shows the average radius of the 
1-nearest neighborhoods. The lower-right panel shows the MSE, squared bias and 
variance curves as a function of dimension p. 
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1-NN in One Dimension 


MSE vs. Dimension 




x 


Dimension 


FIGURE 2.8. A simulation example with the same setup as in Figure 2.7. Here 
the function is constant in all but one dimension: F(X) = |( X\ + l) 3 . The 
variance dominates. 


unbiased, we find that 

EPE(xo) = E yo | Xo E r (?/o - 2/o) 2 

= Var(yol^o) + F r [y 0 - E r y 0 } 2 + [E r y 0 - /3] 2 

= Var(y 0 |^o) + Var r (y Q ) + Bias 2 (y 0 ) 

= cr 2 +E r x^(X T X)- 1 a;oa 2 + 0 2 . (2.27) 

Here we have incurred an additional variance a 2 in the prediction error, 
since our target is not deterministic. There is no bias, and the variance 
depends on xq. If N is large and T were selected at random, and assuming 
EpQ = 0, then X T X iVCov(X) and 

E Xo EPE(a;o) ~ E Xo XqCov(X)~ 1 x 0 ct‘ 2 /N + a 2 

= trace[Cov(X) _1 Cov(a;o)]o' 2 /iV + a 2 
= cr 2 (p/N) + a 2 . (2.28) 


Here we see that the expected EPE increases linearly as a function of p, 
with slope a 2 /N. If N is large and/or cr 2 is small, this growth in vari¬ 
ance is negligible (0 in the deterministic case). By imposing some heavy 
restrictions on the class of models being fitted, we have avoided the curse 
of dimensionality. Some of the technical details in (2.27) and (2.28) are 
derived in Exercise 2.5. 

Figure 2.9 compares 1-nearest neighbor vs. least squares in two situa¬ 
tions, both of which have the form Y = f{X) + e, X uniform as before, 
and e ~ 1V(0,1). The sample size is N = 500. For the orange curve, f(x) 
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Expected Prediction Error of 1NN vs. OLS 



Dimension 

FIGURE 2.9. The curves show the expected prediction error (at xo = 0 ) for 
l-nearest neighbor relative to least squares for the model Y = f(X) + e. For the 
orange curve, f(x) = x\, while for the blue curve f(x) = |(xi + l) 3 . 


is linear in the first coordinate, for the blue curve, cubic as in Figure 2.8. 
Shown is the relative EPE of l-nearest neighbor to least squares, which 
appears to start at around 2 for the linear case. Least squares is unbiased 
in this case, and as discussed above the EPE is slightly above a 2 = 1. 
The EPE for l-nearest neighbor is always above 2, since the variance of 
f(x o) in this case is at least a 2 , and the ratio increases with dimension as 
the nearest neighbor strays from the target point. For the cubic case, least 
squares is biased, which moderates the ratio. Clearly we could manufacture 
examples where the bias of least squares would dominate the variance, and 
the l-nearest neighbor would come out the winner. 

By relying on rigid assumptions, the linear model has no bias at all and 
negligible variance, while the error in l-nearest neighbor is substantially 
larger. However, if the assumptions are wrong, all bets are off and the 
l-nearest neighbor may dominate. We will see that there is a whole spec¬ 
trum of models between the rigid linear models and the extremely flexible 
1-nearest-neighbor models, each with their own assumptions and biases, 
which have been proposed specifically to avoid the exponential growth in 
complexity of functions in high dimensions by drawing heavily on these 
assumptions. 

Before we delve more deeply, let us elaborate a bit on the concept of 
statistical models and see how they fit into the prediction framework. 
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2.6 Statistical Models, Supervised Learning and 
Function Approximation 

Our goal is to find a useful approximation f(x) to the function f[x) that 
underlies the predictive relationship between the inputs and outputs. In the 
theoretical setting of Section 2.4, we saw that squared error loss lead us 
to the regression function f{x) = E(Y\X = x) for a quantitative response. 
The class of nearest-neighbor methods can be viewed as direct estimates 
of this conditional expectation, but we have seen that they can fail in at 
least two ways: 

• if the dimension of the input space is high, the nearest neighbors need 
not be close to the target point, and can result in large errors; 

• if special structure is known to exist, this can be used to reduce both 
the bias and the variance of the estimates. 

We anticipate using other classes of models for f(x), in many cases specif¬ 
ically designed to overcome the dimensionality problems, and here we dis¬ 
cuss a framework for incorporating them into the prediction problem. 

2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y) 

Suppose in fact that our data arose from a statistical model 

Y=f(X) + e, (2.29) 

where the random error e has E(e) = 0 and is independent of X. Note that 
for this model, f(x) = E(Y\X = x), and in fact the conditional distribution 
Pr(Y|V) depends on X only through the conditional mean f(x). 

The additive error model is a useful approximation to the truth. For 
most systems the input-output pairs (X,Y) will not have a deterministic 
relationship Y = f(X). Generally there will be other unmeasured variables 
that also contribute to Y, including measurement error. The additive model 
assumes that we can capture all these departures from a deterministic re¬ 
lationship via the error e. 

For some problems a deterministic relationship does hold. Many of the 
classification problems studied in machine learning are of this form, where 
the response surface can be thought of as a colored map defined in IR P . 
The training data consist of colored examples from the map {xi,gA, and 
the goal is to be able to color any point. Here the function is deterministic, 
and the randomness enters through the x location of the training points. 
For the moment we will not pursue such problems, but will see that they 
can be handled by techniques appropriate for the error-based models. 

The assumption in (2.29) that the errors are independent and identically 
distributed is not strictly necessary, but seems to be at the back of our mind 
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when we average squared errors uniformly in our EPE criterion. With such 
a model it becomes natural to use least squares as a data criterion for 
model estimation as in (2.1). Simple modifications can be made to avoid 
the independence assumption; for example, we can have Var(Y|X = x) = 
er(x), and now both the mean and variance depend on X. In general the 
conditional distribution Pr(yjX) can depend on X in complicated ways, 
but the additive error model precludes these. 

So far we have concentrated on the quantitative response. Additive error 
models are typically not used for qualitative outputs G; in this case the tar¬ 
get function p(X) is the conditional density Pr(G|X), and this is modeled 
directly. For example, for two-class data, it is often reasonable to assume 
that the data arise from independent binary trials, with the probability of 
one particular outcome being p(X), and the other 1 — p(X). Thus if Y is 
the 0-1 coded version of G, then E(F|X = x) = p(x), but the variance 
depends on x as well: Var(y|X = x) = p{x)[ 1 — p{x)\. 


2.6.2 Supervised Learning 

Before we launch into more statistically oriented jargon, we present the 
function-fitting paradigm from a machine learning point of view. Suppose 
for simplicity that the errors are additive and that the model Y = f(X)+e 
is a reasonable assumption. Supervised learning attempts to learn / by 
example through a teacher. One observes the system under study, both 
the inputs and outputs, and assembles a training set of observations T = 
(xi,yi), i = The observed input values to the system Xj are also 

fed into an artificial system, known as a learning algorithm (usually a com¬ 
puter program), which also produces outputs /(x*) in response to the in¬ 
puts. The learning algorithm has the property that it can modify its in¬ 
put/output relationship / in response to differences yi — f(xi ) between the 
original and generated outputs. This process is known as learning by exam¬ 
ple. Upon completion of the learning process the hope is that the artificial 
and real outputs will be close enough to be useful for all sets of inputs likely 
to be encountered in practice. 


2.6.3 Function Approximation 

The learning paradigm of the previous section has been the motivation 
for research into the supervised learning problem in the fields of machine 
learning (with analogies to human reasoning) and neural networks (with 
biological analogies to the brain). The approach taken in applied mathe¬ 
matics and statistics has been from the perspective of function approxima¬ 
tion and estimation. Here the data pairs {xi : yA are viewed as points in a 
(p + 1)-dimensional Euclidean space. The function /(x) has domain equal 
to the p-dimensional input subspace, and is related to the data via a model 
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such as yi = f(xi) +£j. For convenience in this chapter we will assume the 
domain is IR P , a p-dimensional Euclidean space, although in general the 
inputs can be of mixed type. The goal is to obtain a useful approximation 
to /( x) for all x in some region of IR P , given the representations in T. 
Although somewhat less glamorous than the learning paradigm, treating 
supervised learning as a problem in function approximation encourages the 
geometrical concepts of Euclidean spaces and mathematical concepts of 
probabilistic inference to be applied to the problem. This is the approach 
taken in this book. 

Many of the approximations we will encounter have associated a set of 
parameters 9 that can be modified to suit the data at hand. For example, 
the linear model f(x) = x T /3 has 9 = (3. Another class of useful approxi¬ 
mators can be expressed as linear basis expansions 

K 

f e (x) = j2 hk ^ 9k ’ ( 2 - 3 °) 

k=l 


where the hk are a suitable set of functions or transformations of the input 
vector x. Traditional examples are polynomial and trigonometric expan¬ 
sions, where for example hk might be x\, Xix|, cos(xi) and so on. We 
also encounter nonlinear expansions, such as the sigmoid transformation 
common to neural network models, 


h k {x ) 


1 

1 + exp (-x T /3 k )' 


(2.31) 


We can use least squares to estimate the parameters 9 in fg as we did 
for the linear model, by minimizing the residual sum-of-squares 


N 

RSS (0) = X>i-/fl(si)) 2 (2.32) 

i =1 


as a function of 9. This seems a reasonable criterion for an additive error 
model. In terms of function approximation, we imagine our parameterized 
function as a surface in p + 1 space, and what we observe are noisy re¬ 
alizations from it. This is easy to visualize when p = 2 and the vertical 
coordinate is the output y, as in Figure 2.10. The noise is in the output 
coordinate, so we find the set of parameters such that the fitted surface 
gets as close to the observed points as possible, where close is measured by 
the sum of squared vertical errors in RSS(0). 

For the linear model we get a simple closed form solution to the mini¬ 
mization problem. This is also true for the basis function methods, if the 
basis functions themselves do not have any hidden parameters. Otherwise 
the solution requires either iterative methods or numerical optimization. 

While least squares is generally very convenient, it is not the only crite¬ 
rion used and in some cases would not make much sense. A more general 
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FIGURE 2.10. Least squares fitting of a function of two inputs. The parameters 
of ,fe(x) are chosen so as to minimize the sum-of-squared vertical errors. 

principle for estimation is maximum likelihood estimation. Suppose we have 
a random sample i/,;, i = 1,..., N from a density Prg(y) indexed by some 
parameters 6. The log-probability of the observed sample is 

N 

L(d)=J2 logPrefei). (2.33) 

i =1 

The principle of maximum likelihood assumes that the most reasonable 
values for 0 are those for which the probability of the observed sample is 
largest. Least squares for the additive error model Y = fe(X) + e, with 
e ~ N( 0,cr 2 ), is equivalent to maximum likelihood using the conditional 
likelihood 

Pv(Y\X,e) = N(f e (X),a 2 ). (2.34) 

So although the additional assumption of normality seems more restrictive, 
the results are the same. The log-likelihood of the data is 

AT 1 N 

L(0) = - — log(27r)-ATlogo-- ^ ~ fe(xi)) 2 , (2.35) 

i=1 

and the only term involving 9 is the last, which is RSS((9) up to a scalar 
negative multiplier. 

A more interesting example is the multinomial likelihood for the regres¬ 
sion function Pr(G|A) for a qualitative output G. Suppose we have a model 
Pr(G = Qk\X = x) = Pk,e(x), k = 1,..., K for the conditional probabil¬ 
ity of each class given X , indexed by the parameter vector 9. Then the 
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log-likelihood (also referred to as the cross-entropy) is 

N 

L ( e ) = ^2 l °SPgi,e{xi), (2.36) 

2=1 

and when maximized it delivers values of 6 that best conform with the data 
in this likelihood sense. 


2.7 Structured Regression Models 

We have seen that although nearest-neighbor and other local methods focus 
directly on estimating the function at a point, they face problems in high 
dimensions. They may also be inappropriate even in low dimensions in 
cases where more structured approaches can make more efficient use of the 
data. This section introduces classes of such structured approaches. Before 
we proceed, though, we discuss further the need for such classes. 

2.7.1 Difficulty of the Problem 

Consider the RSS criterion for an arbitrary function /, 

N 

RSS(/) = - f( Xi )) 2 . (2.37) 

2=1 

Minimizing (2.37) leads to infinitely many solutions: any function / passing 
through the training points ( Xi 7 yi ) is a solution. Any particular solution 
chosen might be a poor predictor at test points different from the training 
points. If there are multiple observation pairs Xi,yu, i = 1,..., Ni at each 
value of Xi, the risk is limited. In this case, the solutions pass through 
the average values of the yu at each xr, see Exercise 2.6. The situation is 
similar to the one we have already visited in Section 2.4; indeed, (2.37) is 
the finite sample version of (2.11) on page 18. If the sample size N were 
sufficiently large such that repeats were guaranteed and densely arranged, 
it would seem that these solutions might all tend to the limiting conditional 
expectation. 

In order to obtain useful results for finite N, we must restrict the eligible 
solutions to (2.37) to a smaller set of functions. How to decide on the 
nature of the restrictions is based on considerations outside of the data. 
These restrictions are sometimes encoded via the parametric representation 
of /@, or may be built into the learning method itself, either implicitly or 
explicitly. These restricted classes of solutions are the major topic of this 
book. One thing should be clear, though. Any restrictions imposed on / 
that lead to a unique solution to (2.37) do not really remove the ambiguity 
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caused by the multiplicity of solutions. There are infinitely many possible 
restrictions, each leading to a unique solution, so the ambiguity has simply 
been transferred to the choice of constraint. 

In general the constraints imposed by most learning methods can be 
described as complexity restrictions of one kind or another. This usually 
means some kind of regular behavior in small neighborhoods of the input 
space. That is, for all input points x sufficiently close to each other in 
some metric, / exhibits some special structure such as nearly constant, 
linear or low-order polynomial behavior. The estimator is then obtained by 
averaging or polynomial fitting in that neighborhood. 

The strength of the constraint is dictated by the neighborhood size. The 
larger the size of the neighborhood, the stronger the constraint, and the 
more sensitive the solution is to the particular choice of constraint. For 
example, local constant fits in infinitesimally small neighborhoods is no 
constraint at all; local linear fits in very large neighborhoods is almost a 
globally linear model, and is very restrictive. 

The nature of the constraint depends on the metric used. Some methods, 
such as kernel and local regression and tree-based methods, directly specify 
the metric and size of the neighborhood. The nearest-neighbor methods 
discussed so far are based on the assumption that locally the function is 
constant; close to a target input x 0 , the function does not change much, and 
so close outputs can be averaged to produce f(x o). Other methods such 
as splines, neural networks and basis-function methods implicitly define 
neighborhoods of local behavior. In Section 5.4.1 we discuss the concept 
of an equivalent kernel (see Figure 5.8 on page 157), which describes this 
local dependence for any method linear in the outputs. These equivalent 
kernels in many cases look just like the explicitly defined weighting kernels 
discussed above—peaked at the target point and falling away smoothly 
away from it. 

One fact should be clear by now. Any method that attempts to pro¬ 
duce locally varying functions in small isotropic neighborhoods will run 
into problems in high dimensions—again the curse of dimensionality. And 
conversely, all methods that overcome the dimensionality problems have an 
associated—and often implicit or adaptive—metric for measuring neighbor¬ 
hoods, which basically does not allow the neighborhood to be simultane¬ 
ously small in all directions. 


2.8 Classes of Restricted Estimators 

The variety of nonparametric regression techniques or learning methods fall 
into a number of different classes depending on the nature of the restrictions 
imposed. These classes are not distinct, and indeed some methods fall in 
several classes. Here we give a brief summary, since detailed descriptions 
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are given in later chapters. Each of the classes has associated with it one 
or more parameters, sometimes appropriately called smoothing parameters, 
that control the effective size of the local neighborhood. Here we describe 
three broad classes. 

2.8.1 Roughness Penalty and Bayesian Methods 

Here the class of functions is controlled by explicitly penalizing RSS(/) 
with a roughness penalty 

PRSS(/;A)=RSS(/)+AJ(/). (2.38) 

The user-selected functional </(/) will be large for functions / that vary too 
rapidly over small regions of input space. For example, the popular cubic 
smoothing spline for one-dimensional inputs is the solution to the penalized 
least-squares criterion 


N 

PRSS(/;A)=^(y^/(x i )) 2 + A 

i=1 

The roughness penalty here controls large values of the second derivative 
of /, and the amount of penalty is dictated by A > 0. For A = 0 no penalty 
is imposed, and any interpolating function will do, while for A = oo only 
functions linear in x are permitted. 

Penalty functionals J can be constructed for functions in any dimension, 
and special versions can be created to impose special structure. For ex¬ 
ample, additive penalties </(/) = J{fj) are use d in conjunction with 

additive functions f(X) = Y^j=\fj{Xj) to create additive models with 
smooth coordinate functions. Similarly, projection pursuit regression mod¬ 
els have f(X) = £m=i g m (a^X) for adaptively chosen directions a m , and 
the functions g m can each have an associated roughness penalty. 

Penalty function, or regularization methods, express our prior belief that 
the type of functions we seek exhibit a certain type of smooth behavior, and 
indeed can usually be cast in a Bayesian framework. The penalty J corre¬ 
sponds to a log-prior, and PRSS(/; A) the log-posterior distribution, and 
minimizing PRSS(/; A) amounts to finding the posterior mode. We discuss 
roughness-penalty approaches in Chapter 5 and the Bayesian paradigm in 
Chapter 8. 


J [f"(x)} 2 dx. (2.39) 


2.8.2 Kernel Methods and Local Regression 

These methods can be thought of as explicitly providing estimates of the re¬ 
gression function or conditional expectation by specifying the nature of the 
local neighborhood, and of the class of regular functions fitted locally. The 
local neighborhood is specified by a kernel function K\(xq, x) which assigns 
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weights to points x in a region around xq (see Figure 6.1 on page 192). For 
example, the Gaussian kernel has a weight function based on the Gaussian 
density function 


K x (x 0 ,x) = - exp 


\\x-xo 

2A 


(2.40) 


and assigns weights to points that die exponentially with their squared 
Euclidean distance from xo■ The parameter A corresponds to the variance 
of the Gaussian density, and controls the width of the neighborhood. The 
simplest form of kernel estimate is the Nadaraya-Watson weighted average 


/O o) 


£"i*A(so,Si)yi 

£*Li K\( x o, Xi) 


(2.41) 


In general we can define a local regression estimate of f(x o) as fg(Xo), 
where 6 minimizes 


N 

RSS{f e ,x 0 ) =y^ i K\(xo,x i )(y i - fg(xi)) 2 , (2.42) 

i =1 

and fg is some parameterized function, such as a low-order polynomial. 
Some examples are: 

• fg(x) = 0 q , the constant function; this results in the Nadaraya- 
Watson estimate in (2.41) above. 

• fg(x) = 0q + 9\X gives the popular local linear regression model. 

Nearest-neighbor methods can be thought of as kernel methods having a 
more data-dependent metric. Indeed, the metric for fc-nearest neighbors is 

K k (x,x 0 ) = I(\\x — x 0 || < ||x (fc) -a:o||), 

where x^) is th e training observation ranked fcth in distance from xo, and 
I(S) is the indicator of the set S. 

These methods of course need to be modified in high dimensions, to avoid 
the curse of dimensionality. Various adaptations are discussed in Chapter 6. 

2.8.3 Basis Functions and Dictionary Methods 

This class of methods includes the familiar linear and polynomial expan¬ 
sions, but more importantly a wide variety of more flexible models. The 
model for / is a linear expansion of basis functions 

M 

Se{x) — ^ ( 9mhm(x')i 

m—1 


(2.43) 
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where each of the h m is a function of the input x, and the term linear here 
refers to the action of the parameters 0. This class covers a wide variety of 
methods. In some cases the sequence of basis functions is prescribed, such 
as a basis for polynomials in x of total degree M. 

For one-dimensional x, polynomial splines of degree K can be represented 
by an appropriate sequence of M spline basis functions, determined in turn 
by M — I\ knots. These produce functions that are piecewise polynomials 
of degree K between the knots, and joined up with continuity of degree 
K — 1 at the knots. As an example consider linear splines, or piecewise 
linear functions. One intuitively satisfying basis consists of the functions 
bi(x) = 1, b 2 {x) = x, and b m+2 (x) = (x - t m )+, m = - 2, 

where t m is the mth knot, and z+ denotes positive part. Tensor products 
of spline bases can be used for inputs with dimensions larger than one 
(see Section 5.2, and the CART and MARS models in Chapter 9.) The 
parameter 8 can be the total degree of the polynomial or the number of 
knots in the case of splines. 

Radial basis functions are symmetric p-dimensional kernels located at 
particular centroids, 


M 

fe{x) = ^2 K ^m(t l m,x)8 rn - (2.44) 

m— 1 

for example, the Gaussian kernel K\(ii,x) = e~\\ x -^W 2 / 2X is popular. 

Radial basis functions have centroids /i m and scales A m that have to 
be determined. The spline basis functions have knots. In general we would 
like the data to dictate them as well. Including these as parameters changes 
the regression problem from a straightforward linear problem to a combi- 
natorially hard nonlinear problem. In practice, shortcuts such as greedy 
algorithms or two stage processes are used. Section 6.7 describes some such 
approaches. 

A single-layer feed-forward neural network model with linear output 
weights can be thought of as an adaptive basis function method. The model 
has the form 

M 

fe{x ) = ^2 Pmfxia^x + b m ), (2.45) 

m= 1 

where cr(x) = 1/(1 + e~ x ) is known as the activation function. Here, as 
in the projection pursuit model, the directions a m and the bias terms b m 
have to be determined, and their estimation is the meat of the computation. 
Details are give in Chapter 11. 

These adaptively chosen basis function methods are also known as dictio¬ 
nary methods, where one has available a possibly infinite set or dictionary 
V of candidate basis functions from which to choose, and models are built 
up by employing some kind of search mechanism. 
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2.9 Model Selection and the Bias-Variance 
Tradeoff 


All the models described above and many others discussed in later chapters 
have a smoothing or complexity parameter that has to be determined: 

• the multiplier of the penalty term; 

• the width of the kernel; 

• or the number of basis functions. 

In the case of the smoothing spline, the parameter A indexes models ranging 
from a straight line fit to the interpolating model. Similarly a local degree- 
m polynomial model ranges between a degree-m global polynomial when 
the window size is infinitely large, to an interpolating fit when the window 
size shrinks to zero. This means that we cannot use residual sum-of-squares 
on the training data to determine these parameters as well, since we would 
always pick those that gave interpolating fits and hence zero residuals. Such 
a model is unlikely to predict future data well at all. 

The k- nearest-neighbor regression fit /*,( Xq) usefully illustrates the com¬ 
peting forces that affect the predictive ability of such approximations. Sup¬ 
pose the data arise from a model Y = f(X) + e, with E(e) = 0 and 
Var(e) = a 2 . For simplicity here we assume that the values of Xi in the 
sample are fixed in advance (nonrandom). The expected prediction error 
at xq, also known as test or generalization error, can be decomposed: 


EPEfe(xo) = E[(y — fk{xo)) 2 \X = Xq] 

= (T 2 + [Bias 2 (/ fc (so)) + Var r (/ fe (x 0 ))] (2.46) 

k o 9 

(2.47) 



The subscripts in parentheses (£) indicate the sequence of nearest neighbors 
to Xq. 

There are three terms in this expression. The first term er 2 is the ir¬ 
reducible error—the variance of the new test target—and is beyond our 
control, even if we know the true f(x o). 

The second and third terms are under our control, and make up the 
mean squared error of /^Vo) in estimating f(x o), which is broken down 
into a bias component and a variance component. The bias term is the 
squared difference between the true mean f(x o) and the expected value of 
the estimate—[E- t//*, Vo)) — /Vo)] 2 —where the expectation averages the 
randomness in the training data. This term will most likely increase with 
k, if the true function is reasonably smooth. For small k the few closest 
neighbors will have values f(x^) close to /Vo), so their average should 
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FIGURE 2.11. Test and training error as a function of model complexity. 


be close to f(x o). As k grows, the neighbors are further away, and then 
anything can happen. 

The variance term is simply the variance of an average here, and de¬ 
creases as the inverse of k. So as k varies, there is a bias-variance tradeoff. 

More generally, as the model complexity of our procedure is increased, the 
variance tends to increase and the squared bias tends to decrease. The op¬ 
posite behavior occurs as the model complexity is decreased. For fc-nearest 
neighbors, the model complexity is controlled by k. 

Typically we would like to choose our model complexity to trade bias 
off with variance in such a way as to minimize the test error. An obvious 
estimate of test error is the training error ~ Vi) 2 - Unfortunately 

training error is not a good estimate of test error, as it does not properly 
account for model complexity. 

Figure 2.11 shows the typical behavior of the test and training error, as 
model complexity is varied. The training error tends to decrease whenever 
we increase the model complexity, that is, whenever we fit the data harder. 
However with too much fitting, the model adapts itself too closely to the 
training data, and will not generalize well (i.e., have large test error). In 
that case the predictions f(x o) will have large variance, as reflected in the 
last term of expression (2.46). In contrast, if the model is not complex 
enough, it will underfit and may have large bias, again resulting in poor 
generalization. In Chapter 7 we discuss methods for estimating the test 
error of a prediction method, and hence estimating the optimal amount of 
model complexity for a given prediction method and training set. 
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Exercises 


Ex. 2.1 Suppose each of IC-classes has an associated target f*,, which is a 
vector of all zeros, except a one in the fcth position. Show that classifying to 
the largest element of y amounts to choosing the closest target, min*, \\tk — 
y ||, if the elements of y sum to one. 

Ex. 2.2 Show how to compute the Bayes decision boundary for the simula¬ 
tion example in Figure 2.5. 

Ex. 2.3 Derive equation (2.24). 

Ex. 2.4 The edge effect problem discussed on page 23 is not peculiar to 
uniform sampling from bounded domains. Consider inputs drawn from a 
spherical multinormal distribution X ~ N( 0,I p ). The squared distance 
from any sample point to the origin has a Xp distribution with mean p. 
Consider a prediction point xo drawn from this distribution, and let a = 
x 0 /||x 0 || be an associated unit vector. Let Zi = a T Xi be the projection of 
each of the training points on this direction. 

Show that the Zi are distributed N( 0,1) with expected squared distance 
from the origin 1, while the target point has expected squared distance p 
from the origin. 

Hence for p = 10, a randomly drawn test point is about 3.1 standard 
deviations from the origin, while all the training points are on average 
one standard deviation along direction a. So most prediction points see 
themselves as lying on the edge of the training set. 

Ex. 2.5 

(a) Derive equation (2.27). The last line makes use of (3.8) through a 
conditioning argument. 

(b) Derive equation (2.28), making use of the cyclic property of the trace 

operator [trace(HH) = trac e(BA)\, and its linearity (which allows us 
to interchange the order of trace and expectation). 

Ex. 2.6 Consider a regression problem with inputs Xj and outputs yi, and a 
parameterized model fg(x) to be fit by least squares. Show that if there are 
observations with tied or identical values of x, then the fit can be obtained 
from a reduced weighted least squares problem. 
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Ex. 2.7 Suppose we have a sample of N pairs Xi,yi drawn i.i.d. from the 
distribution characterized as follows: 

Xi ~ h(x), the design density 

yi = f(xi) + £i, f is the regression function 

e.j ~ (0,(J 2 ) (mean zero, variance cr 2 ) 

We construct an estimator for / linear in the j/j, 

N 

Rx o) = 'Y^ti{x 0 \X)y i , 

i=1 

where the weights £j(x o; X) do not depend on the y j, but do depend on the 
entire training sequence of Xi, denoted here by X. 

(a) Show that linear regression and /c-nearest-neighbor regression are mem¬ 

bers of this class of estimators. Describe explicitly the weights l - h (xq ; X) 
in each of these cases. 

(b) Decompose the conditional mean-squared error 

Ey\x(f(*o) ~ f(x 0 )) 2 

into a conditional squared bias and a conditional variance component. 
Like X, y represents the entire training sequence of yt- 

(c) Decompose the (unconditional) mean-squared error 

Ey,x(f(x 0 ) ~ Rx o )) 2 

into a squared bias and a variance component. 

(d) Establish a relationship between the squared biases and variances in 
the above two cases. 

Ex. 2.8 Compare the classification performance of linear regression and k- 
nearest neighbor classification on the zipcode data. In particular, consider 
only the 2’s and 3’s, and k = 1, 3,5, 7 and 15. Show both the training and 
test error for each choice. The zipcode data are available from the book 
website www-stat. Stanford. edu/ElemStatLearn. 

Ex. 2.9 Consider a linear regression model with p parameters, fit by least 
squares to a set of training data (aq, yi),... , (xjv, j/tv) drawn at random 
from a population. Let (3 be the least squares estimate. Suppose we have 
some test data (ah, y \),..., (xm, Vm) drawn at random from the same pop¬ 
ulation as the training data. If Rt r (0) = jj YR\ {Vi ~ fi T Xi) 2 and R te {i 3) = 
R Ei Rvi - P T Xi ) 2 , prove that 

E[Rtr0)} < E[R te 0)], 


Exercises 
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where the expectations are over all that is random in each expression. [This 
exercise was brought to our attention by Ryan Tibshirani, from a homework 
assignment given by Andrew Ng.] 
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3 

Linear Methods for Regression 


This is page 43 
Printer: Opaque this 


3.1 Introduction 

A linear regression model assumes that the regression function E(Y|A!) is 
linear in the inputs Xi,... ,X p . Linear models were largely developed in 
the precomputer age of statistics, but even in today’s computer era there 
are still good reasons to study and use them. They are simple and often 
provide an adequate and interpretable description of how the inputs affect 
the output. For prediction purposes they can sometimes outperform fancier 
nonlinear models, especially in situations with small numbers of training 
cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be 
applied to transformations of the inputs and this considerably expands their 
scope. These generalizations are sometimes called basis-function methods, 
and are discussed in Chapter 5. 

In this chapter we describe linear methods for regression, while in the 
next chapter we discuss linear methods for classification. On some topics we 
go into considerable detail, as it is our firm belief that an understanding 
of linear methods is essential for understanding nonlinear ones. In fact, 
many nonlinear techniques are direct generalizations of the linear methods 
discussed here. 
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3.2 Linear Regression Models and Least Squares 

As introduced in Chapter 2, we have an input vector X T = (Ai, X 2 ,..., A p ), 
and want to predict a real-valued output Y. The linear regression model 
has the form 

f(X)=po + iZx^j. (3.1) 

1=1 

The linear model either assumes that the regression function E(Y)A) is 
linear, or that the linear model is a reasonable approximation. Here the 
/ 3j ’s are unknown parameters or coefficients, and the variables Xj can come 
from different sources: 

• quantitative inputs; 

• transformations of quantitative inputs, such as log, square-root or 
square; 

• basis expansions, such as X 2 = Xf, X 3 = Xf, leading to a polynomial 
representation; 

• numeric or “dummy” coding of the levels of qualitative inputs. For 
example, if G is a five-level factor input, we might create Xj, j = 
1,..., 5, such that Xj = I{G = j). Together this group of Xj repre¬ 
sents the effect of G by a set of level-dependent constants, since in 

Xj/3j, one of the XjS is one, and the others are zero. 

• interactions between variables, for example, A 3 = Ai • X 2 . 

No matter the source of the Xj, the model is linear in the parameters. 

Typically we have a set of training data (aq, ?q)... (xjv, Un) from which 
to estimate the parameters /3. Each Xi = (xn, Xa ,..., Xi P ) T is a vector 
of feature measurements for the ith case. The most popular estimation 
method is least squares, in which we pick the coefficients /3 = (/3o, /3i,..., /3 P ) T 
to minimize the residual sum of squares 

N 

RSS(/3) = 

»=1 

N V 2 

= ' ( 3 - 2 ) 
*=1 J =1 

From a statistical point of view, this criterion is reasonable if the training 
observations ( Xi,yi ) represent independent random draws from their popu¬ 
lation. Even if the Xi’s were not drawn randomly, the criterion is still valid 
if the yi s are conditionally independent given the inputs Xj. Figure 3.1 
illustrates the geometry of least-squares fitting in the IR P+1 -dimensional 
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FIGURE 3.1. Linear least squares fitting with X £ IR 2 . We seek the linear 
function of X that minimizes the sum of squared residuals from Y. 


space occupied by the pairs (X,Y). Note that (3.2) makes no assumptions 
about the validity of model (3.1); it simply finds the best linear fit to the 
data. Least squares fitting is intuitively satisfying no matter how the data 
arise; the criterion measures the average lack of fit. 

How do we minimize (3.2)? Denote by X the N x (p + 1) matrix with 
each row an input vector (with a 1 in the first position), and similarly let 
y be the N -vector of outputs in the training set. Then we can write the 
residual sum-of-squares as 

RSS(/3) = (y-X/3) T (y-X/3). (3.3) 


This is a quadratic function in the p + 1 parameters. Differentiating with 
respect to /3 we obtain 


<9RSS 

<9/3 

<9 2 RSS 

d(3d(3 T 


-2X T (y -X/3) 

2X T X. 


(3.4) 


Assuming (for the moment) that X has full column rank, and hence X T X 
is positive definite, we set the first derivative to zero 

X T (y — X/3) = 0 (3.5) 


to obtain the unique solution 

P = (X T X) -1 X T y. 


(3.6) 
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FIGURE 3.2. The N-dimensional geometry of least squares regression with two 
predictors. The outcome vector y is orthogonally projected onto the hyperplane 
spanned by the input vectors xi and X 2 . The projection y represents the vector 
of the least squares predictions 

The predicted values at an input vector Xq are given by f{x o) = (1 : Xq) t 
the fitted values at the training inputs are 

y = X/3 = X(X T X)- 1 X T y, (3.7) 

where iji = f(xi). The matrix H = X(X T X) _1 X T appearing in equation 
(3.7) is sometimes called the “hat” matrix because it puts the hat on y. 

Figure 3.2 shows a different geometrical representation of the least squares 
estimate, this time in M v . We denote the column vectors of X by xo,xi,...,x p , 
with x 0 = 1. For much of what follows, this first column is treated like any 
other. These vectors span a subspace of IR^, also referred to as the column 
space of X. We minimize RSS(/3) = ||y — X/3|| 2 by choosing /3 so that the 
residual vector y — y is orthogonal to this subspace. This orthogonality is 
expressed in (3.5), and the resulting estimate y is hence the orthogonal pro¬ 
jection of y onto this subspace. The hat matrix H computes the orthogonal 
projection, and hence it is also known as a projection matrix. 

It might happen that the columns of X are not linearly independent, so 
that X is not of full rank. This would occur, for example, if two of the 
inputs were perfectly correlated, (e.g., X 2 = 3xi). Then X T X is singular 
and the least squares coefficients /3 are not uniquely defined. However, 
the fitted values y = X/3 are still the projection of y onto the column 
space of X; there is just more than one way to express that projection 
in terms of the column vectors of X. The non-full-rank case occurs most 
often when one or more qualitative inputs are coded in a redundant fashion. 
There is usually a natural way to resolve the non-unique representation, 
by recoding and/or dropping redundant columns in X. Most regression 
software packages detect these redundancies and automatically implement 
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some strategy for removing them. Rank deficiencies can also occur in signal 
and image analysis, where the number of inputs p can exceed the number 
of training cases N. In this case, the features are typically reduced by 
filtering or else the fitting is controlled by regularization (Section 5.2.3 and 
Chapter 18). 

Up to now we have made minimal assumptions about the true distribu¬ 
tion of the data. In order to pin down the sampling properties of /?, we now 
assume that the observations yi are uncorrelated and have constant vari¬ 
ance cr 2 , and that the Xi are fixed (non random). The variance-covariance 
matrix of the least squares parameter estimates is easily derived from (3.6) 
and is given by 

Var(/3) = (X T X)-V 2 . (3.8) 

Typically one estimates the variance a 2 by 

1 N 

&2 = jv-p-i E ( ^-^ )2 - 
1 i =1 

The N — p — 1 rather than N in the denominator makes a 2 an unbiased 
estimate of a 2 : E(cr 2 ) = a 2 . 

To draw inferences about the parameters and the model, additional as¬ 
sumptions are needed. We now assume that (3.1) is the correct model for 
the mean; that is, the conditional expectation of Y is linear in X \,..., X p . 
We also assume that the deviations of Y around its expectation are additive 
and Gaussian. Hence 


Y = E(Y\X 1 } ...,X p ) + e 

p 

= A) + Xjfij + e, (3.9) 

i=i 

where the error £ is a Gaussian random variable with expectation zero and 
variance cr 2 , written e ~ N(0,a 2 ). 

Under (3.9), it is easy to show that 

/3 ~ N(/3, (X T X) _1 cr 2 ). (3.10) 

This is a multivariate normal distribution with mean vector and variance- 
covariance matrix as shown. Also 

(N-p- 1)<t 2 ~ ct 2 x^_ p _ 1 , (3.11) 

a chi-squared distribution with N — p — 1 degrees of freedom. In addition $ 
and <t 2 are statistically independent. We use these distributional properties 
to form tests of hypothesis and confidence intervals for the parameters f3j. 
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FIGURE 3.3. The tail probabilities Pr(|Z| > z) for three distributions, tzo, two 
and standard normal. Shown are the appropriate quantiles for testing significance 
at the p = 0.05 and 0.01 levels. The difference between t and the standard normal 
becomes negligible for N bigger than about 100. 


To test the hypothesis that a particular coefficient ,6j = 0, we form the 
standardized coefficient or Z-score 


Zn = 



(3.12) 


where Vj is the jth diagonal element of (X 2 X) -1 . Under the null hypothesis 
that /3j = 0, Zj is distributed as t^-p-i (a t distribution with N — p — 1 
degrees of freedom), and hence a large (absolute) value of Zj will lead to 
rejection of this null hypothesis. If a is replaced by a known value a , then 
Zj would have a standard normal distribution. The difference between the 
tail quantiles of a t-distribution and a standard normal become negligible 
as the sample size increases, and so we typically use the normal quantiles 
(see Figure 3.3). 

Often we need to test for the significance of groups of coefficients simul¬ 
taneously. For example, to test if a categorical variable with k levels can 
be excluded from a model, we need to test whether the coefficients of the 
dummy variables used to represent the levels can all be set to zero. Here 
we use the F statistic, 


(RSSo-RSSQ/fa-po) 

RSSr/tiV-pr-l) 


(3.13) 


where RSSi is the residual sum-of-squares for the least squares fit of the big¬ 
ger model with pi + 1 parameters, and RSSo the same for the nested smaller 
model with po + l parameters, having pi —po parameters constrained to be 
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zero. The F statistic measures the change in residual sum-of-squares per 
additional parameter in the bigger model, and it is normalized by an esti¬ 
mate of cr 2 . Under the Gaussian assumptions, and the null hypothesis that 
the smaller model is correct, the F statistic will have a U Pl _ POj jv-pi-i dis¬ 
tribution. It can be shown (Exercise 3.1) that the Zj in (3.12) are equivalent 
to the F statistic for dropping the single coefficient /3j from the model. For 
large IV, the quantiles of F Pl - POt jv-pi-i approach those of X Pi - Po /(pi — Po)- 
Similarly, we can isolate Bj in (3.10) to obtain a 1 — 2 a confidence interval 
for fy: 

fo + z^-^vfa). (3.14) 

Here is the 1 — a percentile of the normal distribution: 

^(1-0.025) = 196; 

^(i-.05) = 1.645, etc. 

Hence the standard practice of reporting B ± 2 ■ se(/3) amounts to an ap¬ 
proximate 95% confidence interval. Even if the Gaussian error assumption 
does not hold, this interval will be approximately correct, with its coverage 
approaching 1 — 2a as the sample size N —> oo. 

In a similar fashion we can obtain an approximate confidence set for the 
entire parameter vector /3, namely 

C/3 = {/?|(/3 - /3) t X t X(/ 3 - /3) < <j 2 Xp +1 (1_ “ ) }, (3.15) 

where y 2 ^ 1 is the l — a percentile of the chi-squared distribution on £ 
degrees of freedom: for example, xi^ 1 ° ° 5 ^ = 11.1, xi^ ° ^ = 9.2. This 
confidence set for /3 generates a corresponding confidence set for the true 
function /( x) = x T /3, namely {x T /3\/3 £ Cp} (Exercise 3.2; see also Fig¬ 
ure 5.4 in Section 5.2.2 for examples of confidence bands for functions). 


3.2.1 Example: Prostate Cancer 

The data for this example come from a study by Stamey et al. (1989). They 
examined the correlation between the level of prostate-specific antigen and 
a number of clinical measures in men who were about to receive a radical 
prostatectomy. The variables are log cancer volume (lcavol), log prostate 
weight (lweight), age, log of the amount of benign prostatic hyperplasia 
(lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), 
Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). 
The correlation matrix of the predictors given in Table 3.1 shows many 
strong correlations. Figure 1.1 (page 3) of Chapter 1 is a scatterplot matrix 
showing every pairwise plot between the variables. We see that svi is a 
binary variable, and gleason is an ordered categorical variable. We see, for 
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TABLE 3.1. Correlations of predictors in the prostate cancer data. 



lcavol 

lweight 

age 

lbph 

svi 

lcp gleason 

lweight 

0.300 






age 

0.286 

0.317 





lbph 

0.063 

0.437 

0.287 




svi 

0.593 

0.181 

0.129 

-0.139 



lcp 

0.692 

0.157 

0.173 

-0.089 

0.671 


gleason 

0.426 

0.024 

0.366 

0.033 

0.307 

0.476 

Pgg45 

0.483 

0.074 

0.276 

-0.030 

0.481 

0.663 0.757 


TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the 
coefficient divided by its standard error (3.12). Roughly a Z score larger than two 
in absolute value is significantly nonzero at the p = 0.05 level. 


Term 

Coefficient 

Std. Error 

Z Score 

Intercept 

2.46 

0.09 

27.60 

lcavol 

0.68 

0.13 

5.37 

lweight 

0.26 

0.10 

2.75 

age 

-0.14 

0.10 

-1.40 

lbph 

0.21 

0.10 

2.06 

svi 

0.31 

0.12 

2.47 

lcp 

-0.29 

0.15 

-1.87 

gleason 

-0.02 

0.15 

-0.15 

pgg45 

0.27 

0.15 

1.74 


example, that both lcavol and lcp show a strong relationship with the 
response Ipsa, and with each other. We need to fit the effects jointly to 
untangle the relationships between the predictors and the response. 

We fit a linear model to the log of prostate-specific antigen, Ipsa, after 
first standardizing the predictors to have unit variance. We randomly split 
the dataset into a training set of size 67 and a test set of size 30. We ap¬ 
plied least squares estimation to the training set, producing the estimates, 
standard errors and Z-scores shown in Table 3.2. The Z-scores are defined 
in (3.12), and measure the effect of dropping that variable from the model. 
A Z-score greater than 2 in absolute value is approximately significant at 
the 5% level. (For our example, we have nine parameters, and the 0.025 tail 
quantiles of the £ 67-9 distribution are ±2.002!) The predictor lcavol shows 
the strongest effect, with lweight and svi also strong. Notice that lcp is 
not significant, once lcavol is in the model (when used in a model without 
lcavol, lcp is strongly significant). We can also test for the exclusion of 
a number of terms at once, using the F-statistic (3.13). For example, we 
consider dropping all the non-significant terms in Table 3.2, namely age, 
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lcp, gleason, and pgg45. We get 

(32.81 - 29.43)/(9- 5) 
~ 29.43/(67-9) 


1.67, 


(3.16) 


which has a p-value of 0.17 (Pr(i 7 4 i 5 g > 1.67) = 0.17), and hence is not 
significant. 

The mean prediction error on the test data is 0.521. In contrast, predic¬ 
tion using the mean training value of Ipsa has a test error of 1.057, which 
is called the “base error rate.” Hence the linear model reduces the base 
error rate by about 50%. We will return to this example later to compare 
various selection and shrinkage methods. 


3.2.2 The Gauss-Markov Theorem 

One of the most famous results in statistics asserts that the least squares 
estimates of the parameters /3 have the smallest variance among all linear 
unbiased estimates. We will make this precise here, and also make clear 
that the restriction to unbiased estimates is not necessarily a wise one. This 
observation will lead us to consider biased estimates such as ridge regression 
later in the chapter. We focus on estimation of any linear combination of 
the parameters 8 = a T /3; for example, predictions f(x o) = Xq (5 are of this 
form. The least squares estimate of a T /3 is 

8 = a T l3 = a T (X T X)- 1 X T y. (3.17) 

Considering X to be fixed, this is a linear function cj^y of the response 
vector y. If we assume that the linear model is correct, a T /3 is unbiased 
since 

E(a T /3) = E(a T (X T X)" 1 X T y) 

= a T (X T X)" 1 X T X/3 
= a T p. (3.18) 

The Gauss-Markov theorem states that if we have any other linear estima¬ 
tor 8 = c T y that is unbiased for a T /3, that is, E(c T y) = a T /3, then 

Var(a T /3) < Var(c T y). (3.19) 

The proof (Exercise 3.3) uses the triangle inequality. For simplicity we have 

stated the result in terms of estimation of a single parameter a T /?, but with 
a few more definitions one can state it in terms of the entire parameter 
vector /3 (Exercise 3.3). 

Consider the mean squared error of an estimator 8 in estimating 8: 

MSE(O) = E(0 — 8) 2 

= Var(0) + [E(0) — 8] 2 . 


(3.20) 
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The first term is the variance, while the second term is the squared bias. 
The Gauss-Markov theorem implies that the least squares estimator has the 
smallest mean squared error of all linear estimators with no bias. However, 
there may well exist a biased estimator with smaller mean squared error. 
Such an estimator would trade a little bias for a larger reduction in variance. 
Biased estimates are commonly used. Any method that shrinks or sets to 
zero some of the least squares coefficients may result in a biased estimate. 
We discuss many examples, including variable subset selection and ridge 
regression, later in this chapter. From a more pragmatic point of view, most 
models are distortions of the truth, and hence are biased; picking the right 
model amounts to creating the right balance between bias and variance. 
We go into these issues in more detail in Chapter 7. 

Mean squared error is intimately related to prediction accuracy, as dis¬ 
cussed in Chapter 2. Consider the prediction of the new response at input 
xo, 

Y) = fi x o) + £ o- (3-21) 

Then the expected prediction error of an estimate f(x o) = xfj3 is 

E(T 0 -/>o)) 2 = a 2 +E(4/3-/(x 0 )) 2 

= cr 2 + MSE(/Oo)). (3.22) 

Therefore, expected prediction error and mean squared error differ only by 
the constant cr 2 , representing the variance of the new observation y 0 . 


3.2.3 Multiple Regression from Simple Univariate Regression 

The linear model (3.1) with p > 1 inputs is called the multiple linear 
regression model. The least squares estimates (3.6) for this model are best 
understood in terms of the estimates for the univariate (p = 1 ) linear 
model, as we indicate in this section. 

Suppose first that we have a univariate model with no intercept, that is, 


Y = X/3 + e. 


The least squares estimate and residuals are 



n = yi~ Xi/3. 


(3.23) 


(3.24) 


In convenient vector notation, we let y = (j/i,..., yjv) T , x = (aq,..., Xn) t 
and define 

N 

(x,y) = 

i =1 


(3.25) 
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the inner product between x and y 1 . Then we can write 

( 3=^4 

(x,x) 

r = y — x/3. 


(3.26) 


As we will see, this simple univariate regression provides the building block 
for multiple linear regression. Suppose next that the inputs x 1 ,x 2 ,... ,x p 
(the columns of the data matrix X) are orthogonal; that is ( Xj,Xk ) = 0 
for all j ^ k. Then it is easy to check that the multiple least squares esti¬ 
mates /3 j are equal to (xj,y)/(xj,Xj) —the univariate estimates. In other 
words, when the inputs are orthogonal, they have no effect on each other’s 
parameter estimates in the model. 

Orthogonal inputs occur most often with balanced, designed experiments 
(where orthogonality is enforced), but almost never with observational 
data. Hence we will have to orthogonalize them in order to carry this idea 
further. Suppose next that we have an intercept and a single input x. Then 
the least squares coefficient of x has the form 


Pi 


(x- xl,y) 

(x — xl, x — all) ’ 


(3.27) 


where x = J2i x i/N, and 1 = x 0 , the vector of N ones. We can view the 
estimate (3.27) as the result of two applications of the simple regression 
(3.26). The steps are: 

1 . regress x on 1 to produce the residual z = x — xl; 

2. regress y on the residual z to give the coefficient j3\. 

In this procedure, “regress b on a” means a simple univariate regression of b 
on a with no intercept, producing coefficient 7 = (a, b)/(a, a) and residual 
vector b — 7 a. We say that b is adjusted for a, or is “orthogonalized” with 
respect to a. 

Step 1 orthogonalizes x with respect to x 0 = 1. Step 2 is just a simple 
univariate regression, using the orthogonal predictors 1 and z. Figure 3.4 
shows this process for two general inputs Xi and x 2 . The orthogonalization 
does not change the subspace spanned by xi and x 2 , it simply produces an 
orthogonal basis for representing it. 

This recipe generalizes to the case of p inputs, as shown in Algorithm 3.1. 
Note that the inputs Zo,..., Zj-i in step 2 are orthogonal, hence the simple 
regression coefficients computed there are in fact also the multiple regres¬ 
sion coefficients. 


1 The inner-product notation is suggestive of generalizations of linear regression to 
different metric spaces, as well as to probability spaces. 
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FIGURE 3.4. Least squares regression by orthogonalization of the inputs. The 
vector X 2 is regressed on the vector xi, leaving the residual vector z. The regres¬ 
sion of y on z gives the multiple regression coefficient of X 2 . Adding together the 
projections of y on each o/xi and z gives the least squares fit y. 


Algorithm 3.1 Regression by Successive Orthogonalization. 

1. Initialize Zo = Xo = 1. 

2. For j = 1,2 

Regress Xj on z 0 , zi,...,, Zj_i to produce coefficients 7 ej = 
(ze,x.j)/(ze,ze), £ = 0, ...,j — 1 and residual vector Zj = 

-Efe=o7U z fc- 

3. Regress y on the residual z p to give the estimate $ p . 


The result of this algorithm is 


h = 


( z p , y) 

(Zp, Zp ) 


(3.28) 


Re-arranging the residual in step 2, we can see that each of the is a linear 
combination of the z k, k < j. Since the Zj are all orthogonal, they form 
a basis for the column space of X, and hence the least squares projection 
onto this subspace is y. Since z p alone involves x p (with coefficient 1), we 
see that the coefficient (3.28) is indeed the multiple regression coefficient of 
y on x p . This key result exposes the effect of correlated inputs in multiple 
regression. Note also that by rearranging the Xj, any one of them could 
be in the last position, and a similar results holds. Hence stated more 
generally, we have shown that the jth multiple regression coefficient is the 
univariate regression coefficient of y on x J . 0 i 2 ...(j-i)(j+i)...,p! the residual 
after regressing x ? on x 0 , xi,..., Xj_i,x ?+ i,...,x p : 
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The multiple regression coefficient /3j represents the additional 
contribution ofxj on y, after Xj has been adjusted for Xq,x\, ..., Xj_i, 
Xj'+1> ■ • • ) x p . 

If x p is highly correlated with some of the other x^’s, the residual vector 
7i p will be close to zero, and from (3.28) the coefficient j3 p will be very 
unstable. This will be true for all the variables in the correlated set. In 
such situations, we might have all the Z-scores (as in Table 3.2) be small— 
any one of the set can be deleted—yet we cannot delete them all. From 
(3.28) we also obtain an alternate formula for the variance estimates (3.8), 


Var(/3 p ) 




(3.29) 


In other words, the precision with which we can estimate /3 p depends on 
the length of the residual vector z p ; this represents how much of x p is 
unexplained by the other xj,’s. 

Algorithm 3.1 is known as the Gram-Schmidt procedure for multiple 
regression, and is also a useful numerical strategy for computing the esti¬ 
mates. We can obtain from it not just /3 p , but also the entire multiple least 
squares fit, as shown in Exercise 3.4. 

We can represent step 2 of Algorithm 3.1 in matrix form: 


X = Zr, (3.30) 

where Z has as columns the z j (in order), and T is the upper triangular ma¬ 
trix with entries 7 kj ■ Introducing the diagonal matrix D with jth diagonal 
entry Djj = ||zy ||, we get 


X = ZD _1 DT 

= QR, (3.31) 

the so-called QR decomposition of X. Here Q is an TV x (p+ 1) orthogonal 
matrix, Q 7 Q = I, and R is a (p + 1) x [p + 1) upper triangular matrix. 

The QR decomposition represents a convenient orthogonal basis for the 
column space of X. It is easy to see, for example, that the least squares 
solution is given by 


/3 = R _1 Q T y, 

y = QQ T y. 


Equation (3.32) is easy to solve because R is upper triangular 
(Exercise 3.4). 


(3.32) 

(3.33) 
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3.2.4 Multiple Outputs 

Suppose we have multiple outputs Y\, Y 2 ,..., Yk that we wish to predict 
from our inputs Xq, Xi, X 2 , . ■ ■, X p . We assume a linear model for each 
output 


p 

Yk = /3p k + XjPjk + gfc (3.34) 

j =1 

= fk(X)+e k . (3.35) 

With TV training cases we can write the model in matrix notation 

Y = XB + E. (3.36) 

Here Y is the NxK response matrix, with ik entry yik 1 X is the N x (p+ 1) 
input matrix, B is the (p + 1) x K matrix of parameters and E is the 
NxK matrix of errors. A straightforward generalization of the univariate 
loss function (3.2) is 

K N 

RSS(B) = h(.Xi)? (3-37) 

k= 12=1 

= tr[(Y — XB) t (Y — XB)]. (3.38) 

The least squares estimates have exactly the same form as before 

B = (X t X)- 1 X t Y. (3.39) 

Hence the coefficients for the /cth outcome are just the least squares es¬ 

timates in the regression of y k on xo, xi,...,x p . Multiple outputs do not 
affect one another’s least squares estimates. 

If the errors e = (ei,... ,£k) in (3.34) are correlated, then it might seem 
appropriate to modify (3.37) in favor of a multivariate version. Specifically, 
suppose Cov(e) = S, then the multivariate weighted criterion 

N 

RSS(B; S) = - f(x i )) T 'S- 1 (y i - /(*<)) (3.40) 

i =1 

arises naturally from multivariate Gaussian theory. Here /( x) is the vector 
function (/i(x ),..., fx{x)) T , and yi the vector of K responses for obser¬ 
vation i. However, it can be shown that again the solution is given by 
(3.39); K separate regressions that ignore the correlations (Exercise 3.11). 
If the Ej vary among observations, then this is no longer the case, and the 
solution for B no longer decouples. 

In Section 3.7 we pursue the multiple outcome problem, and consider 
situations where it does pay to combine the regressions. 
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3.3 Subset Selection 

There are two reasons why we are often not satisfied with the least squares 
estimates (3.6). 

• The first is prediction accuracy, the least squares estimates often have 
low bias but large variance. Prediction accuracy can sometimes be 
improved by shrinking or setting some coefficients to zero. By doing 
so we sacrifice a little bit of bias to reduce the variance of the predicted 
values, and hence may improve the overall prediction accuracy. 

• The second reason is interpretation. With a large number of predic¬ 
tors, we often would like to determine a smaller subset that exhibit 
the strongest effects. In order to get the “big picture,” we are willing 
to sacrifice some of the small details. 

In this section we describe a number of approaches to variable subset selec¬ 
tion with linear regression. In later sections we discuss shrinkage and hybrid 
approaches for controlling variance, as well as other dimension-reduction 
strategies. These all fall under the general heading model selection. Model 
selection is not restricted to linear models; Chapter 7 covers this topic in 
some detail. 

With subset selection we retain only a subset of the variables, and elim¬ 
inate the rest from the model. Least squares regression is used to estimate 
the coefficients of the inputs that are retained. There are a number of dif¬ 
ferent strategies for choosing the subset. 

3.3.1 Best-Subset Selection 

Best subset regression finds for each k G {0,1,2,... ,p} the subset of size k 
that gives smallest residual sum of squares (3.2). An efficient algorithm— 
the leaps and bounds procedure (Furnival and Wilson, 1974)—makes this 
feasible for p as large as 30 or 40. Figure 3.5 shows all the subset models 
for the prostate cancer example. The lower boundary represents the models 
that are eligible for selection by the best-subsets approach. Note that the 
best subset of size 2, for example, need not include the variable that was 
in the best subset of size 1 (for this example all the subsets are nested). 
The best-subset curve (red lower boundary in Figure 3.5) is necessarily 
decreasing, so cannot be used to select the subset size k. The question of 
how to choose k involves the tradeoff between bias and variance, along with 
the more subjective desire for parsimony. There are a number of criteria 
that one may use; typically we choose the smallest model that minimizes 
an estimate of the expected prediction error. 

Many of the other approaches that we discuss in this chapter are similar, 
in that they use the training data to produce a sequence of models varying 
in complexity and indexed by a single parameter. In the next section we use 
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FIGURE 3.5. All possible subset models for the prostate cancer example. At 
each subset size is shown the residual sum-of-squares for each model of that size. 


cross-validation to estimate prediction error and select fc; the AIC criterion 
is a popular alternative. We defer more detailed discussion of these and 
other approaches to Chapter 7. 


3.3.2 Forward- and Backward-Stepwise Selection 

Rather than search through all possible subsets (which becomes infeasible 
for p much larger than 40), we can seek a good path through them. Forward- 
stepwise selection starts with the intercept, and then sequentially adds into 
the model the predictor that most improves the fit. With many candidate 
predictors, this might seem like a lot of computation; however, clever up¬ 
dating algorithms can exploit the QR decomposition for the current fit to 
rapidly establish the next candidate (Exercise 3.9). Like best-subset re¬ 
gression, forward stepwise produces a sequence of models indexed by fc, the 
subset size, which must be determined. 

Forward-stepwise selection is a greedy algorithm , producing a nested se¬ 
quence of models. In this sense it might seem sub-optimal compared to 
best-subset selection. However, there are several reasons why it might be 
preferred: 
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• Computational; for large p we cannot compute the best subset se¬ 
quence, but we can always compute the forward stepwise sequence 
(even when p N). 

• Statistical; a price is paid in variance for selecting the best subset 
of each size; forward stepwise is a more constrained search, and will 
have lower variance, but perhaps more bias. 
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FIGURE 3.6. Comparison of four subset-selection techniques on a simulated lin¬ 
ear regression problem Y = X T /3 + e. There are N = 300 observations on p = 31 
standard Gaussian variables, with pairwise correlations all equal to 0.85. For 10 of 
the variables, the coefficients are drawn at random from a N( 0, 0.4) distribution; 
the rest are zero. The noise e ~ 1V(0,6.25), resulting in a signal-to-noise ratio of 
0.64. Results are averaged over 50 simulations. Shown is the mean-squared error 
of the estimated coefficient /3(k) at each step from the true /3. 

Backward-stepwise selection starts with the full model, and sequentially 
deletes the predictor that has the least impact on the fit. The candidate for 
dropping is the variable with the smallest Z-score (Exercise 3.10). Backward 
selection can only be used when N > p, while forward stepwise can always 
be used. 

Figure 3.6 shows the results of a small simulation study to compare 
best-subset regression with the simpler alternatives forward and backward 
selection. Their performance is very similar, as is often the case. Included in 
the figure is forward stagewise regression (next section), which takes longer 
to reach minimum error. 
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On the prostate cancer example, best-subset, forward and backward se¬ 
lection all gave exactly the same sequence of terms. 

Some software packages implement hybrid stepwise-selection strategies 
that consider both forward and backward moves at each step, and select 
the “best” of the two. For example in the R package the step function uses 
the AIC criterion for weighing the choices, which takes proper account of 
the number of parameters fit; at each step an add or drop will be performed 
that minimizes the AIC score. 

Other more traditional packages base the selection on ^-statistics, adding 
“significant” terms, and dropping “non-significant” terms. These are out 
of fashion, since they do not take proper account of the multiple testing 
issues. It is also tempting after a model search to print out a summary of 
the chosen model, such as in Table 3.2; however, the standard errors are 
not valid, since they do not account for the search process. The bootstrap 
(Section 8.2) can be useful in such settings. 

Finally, we note that often variables come in groups (such as the dummy 
variables that code a multi-level categorical predictor). Smart stepwise pro¬ 
cedures (such as step in R) will add or drop whole groups at a time, taking 
proper account of their degrees-of-freedom. 


3.3.3 Forward-Stagewise Regression 

Forward-stagewise regression (FS) is even more constrained than forward- 
stepwise regression. It starts like forward-stepwise regression, with an in¬ 
tercept equal to y , and centered predictors with coefficients initially all 0. 
At each step the algorithm identifies the variable most correlated with the 
current residual. It then computes the simple linear regression coefficient 
of the residual on this chosen variable, and then adds it to the current co¬ 
efficient for that variable. This is continued till none of the variables have 
correlation with the residuals—i.e. the least-squares fit when N > p. 

Unlike forward-stepwise regression, none of the other variables are ad¬ 
justed when a term is added to the model. As a consequence, forward 
stagewise can take many more than p steps to reach the least squares fit, 
and historically has been dismissed as being inefficient. It turns out that 
this “slow fitting” can pay dividends in high-dimensional problems. We 
see in Section 3.8.1 that both forward stagewise and a variant which is 
slowed down even further are quite competitive, especially in very high¬ 
dimensional problems. 

Forward-stagewise regression is included in Figure 3.6. In this example it 
takes over 1000 steps to get all the correlations below 10 -4 . For subset size 
fc, we plotted the error for the last step for which there where k nonzero 
coefficients. Although it catches up with the best fit, it takes longer to 
do so. 
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3.3.4 Prostate Cancer Data Example (Continued) 

Table 3.3 shows the coefficients from a number of different selection and 
shrinkage methods. They are best-subset selection using an all-subsets search, 
ridge regression , the lasso , principal components regression and partial least 
squares. Each method has a complexity parameter, and this was chosen to 
minimize an estimate of prediction error based on tenfold cross-validation; 
full details are given in Section 7.10. Briefly, cross-validation works by divid¬ 
ing the training data randomly into ten equal parts. The learning method 
is fit- for a range of values of the complexity parameter—to nine-tenths of 
the data, and the prediction error is computed on the remaining one-tenth. 
This is done in turn for each one-tenth of the data, and the ten prediction 
error estimates are averaged. From this we obtain an estimated prediction 
error curve as a function of the complexity parameter. 

Note that we have already divided these data into a training set of size 
67 and a test set of size 30. Cross-validation is applied to the training set, 
since selecting the shrinkage parameter is part of the training process. The 
test set is there to judge the performance of the selected model. 

The estimated prediction error curves are shown in Figure 3.7. Many of 
the curves are very flat over large ranges near their minimum. Included 
are estimated standard error bands for each estimated error rate, based on 
the ten error estimates computed by cross-validation. We have used the 
“one-standard-error” rule—we pick the most parsimonious model within 
one standard error of the minimum (Section 7.10, page 244). Such a rule 
acknowledges the fact that the tradeoff curve is estimated with error, and 
hence takes a conservative approach. 

Best-subset selection chose to use the two predictors lcvol and lweight. 
The last two lines of the table give the average prediction error (and its 
estimated standard error) over the test set. 


3.4 Shrinkage Methods 

By retaining a subset of the predictors and discarding the rest, subset selec¬ 
tion produces a model that is interpretable and has possibly lower predic¬ 
tion error than the full model. However, because it is a discrete process— 
variables are either retained or discarded—it often exhibits high variance, 
and so doesn’t reduce the prediction error of the full model. Shrinkage 
methods are more continuous, and don’t suffer as much from high 
variability. 


3-4-1 Ridge Regression 

Ridge regression shrinks the regression coefficients by imposing a penalty 
on their size. The ridge coefficients minimize a penalized residual sum of 
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FIGURE 3.7. Estimated prediction error curves and their standard errors for 
the various selection and shrinkage methods. Each curve is plotted as a function 
of the corresponding complexity parameter for that method. The horizontal axis 
has been chosen so that the model complexity increases as we move from left to 
right. The estimates of prediction error and their standard errors were obtained by 
tenfold cross-validation; full details are given in Section 7.10. The least complex 
model within one standard error of the best is chosen, indicated by the purple 
vertical broken lines. 
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TABLE 3.3. Estimated coefficients and test error results, for different subset 
and shrinkage methods applied to the prostate data. The blank entries correspond 
to variables omitted. 


Term 

LS Best Subset 

Ridge 

Lasso 

PCR 

PLS 

Intercept 

2.465 

2.477 

2.452 

2.468 

2.497 

2.452 

lcavol 

0.680 

0.740 

0.420 

0.533 

0.543 

0.419 

1weight 

0.263 

0.316 

0.238 

0.169 

0.289 

0.344 

age 

-0.141 


-0.046 


-0.152 

-0.026 

lbph 

0.210 


0.162 

0.002 

0.214 

0.220 

svi 

0.305 


0.227 

0.094 

0.315 

0.243 

lcp 

-0.288 


0.000 


-0.051 

0.079 

gleason 

-0.021 


0.040 


0.232 

0.011 

Pgg45 

0.267 


0.133 


-0.056 

0.084 

Test Error 
Std Error 

0.521 

0.179 

0.492 

0.143 

0.492 

0.165 

0.479 

0.164 

0.449 

0.105 

0.528 

0.152 


squares, 

, N p p •, 

/3 rldge = argmin< ^(t/i - A ~^ x ijPj) 2 + (' ( 3 - 41 ) 

P ' i=1 j =1 j =1 ' 

Here A > 0 is a complexity parameter that controls the amount of shrink¬ 
age: the larger the value of A, the greater the amount of shrinkage. The 
coefficients are shrunk toward zero (and each other). The idea of penaliz¬ 
ing by the sum-of-squares of the parameters is also used in neural networks, 
where it is known as weight decay (Chapter 11). 

An equivalent way to write the ridge problem is 

N p 2 

fridge = argmin ^ (yi - /3 0 - ^ x ijPj) , 

(3-42) 

subject to £^<t, 
l=i 

which makes explicit the size constraint on the parameters. There is a one- 
to-one correspondence between the parameters A in (3.41) and t in (3.42). 
When there are many correlated variables in a linear regression model, 
their coefficients can become poorly determined and exhibit high variance. 
A wildly large positive coefficient on one variable can be canceled by a 
similarly large negative coefficient on its correlated cousin. By imposing a 
size constraint on the coefficients, as in (3.42), this problem is alleviated. 

The ridge solutions are not equivariant under scaling of the inputs, and 
so one normally standardizes the inputs before solving (3.41). In addition, 
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notice that the intercept /3o has been left out of the penalty term. Penal¬ 
ization of the intercept would make the procedure depend on the origin 
chosen for Y ; that is, adding a constant c to each of the targets yt would 
not simply result in a shift of the predictions by the same amount c. It 
can be shown (Exercise 3.5) that the solution to (3.41) can be separated 
into two parts, after reparametrization using centered inputs: each xy-j gets 
replaced by Xij — Xj. We estimate 0o by y = jj'Ei Vi- The remaining co¬ 
efficients get estimated by a ridge regression without intercept, using the 
centered x^. Henceforth we assume that this centering has been done, so 
that the input matrix X has p (rather than p + 1) columns. 

Writing the criterion in (3.41) in matrix form, 

RSS(A) = (y - X/3f(y - X/3) + A/3 T /3, (3.43) 

the ridge regression solutions are easily seen to be 

/3 ridge = (X T X + AI) _1 X T y, (3.44) 

where I is thepxp identity matrix. Notice that with the choice of quadratic 
penalty /3 T ft, the ridge regression solution is again a linear function of 
y. The solution adds a positive constant to the diagonal of X T X before 
inversion. This makes the problem nonsingular, even if X T X is not of full 
rank, and was the main motivation for ridge regression when it was first 
introduced in statistics (Hoerl and Kennard, 1970). Traditional descriptions 
of ridge regression start with definition (3.44). We choose to motivate it via 
(3.41) and (3.42), as these provide insight into how it works. 

Figure 3.8 shows the ridge coefficient estimates for the prostate can¬ 
cer example, plotted as functions of df(A), the effective degrees of freedom 
implied by the penalty A (defined in (3.50) on page 68). In the case of or¬ 
thonormal inputs, the ridge estimates are just a scaled version of the least 
squares estimates, that is, /3 rldge = 0/(1 + A). 

Ridge regression can also be derived as the mean or mode of a poste¬ 
rior distribution, with a suitably chosen prior distribution. In detail, sup¬ 
pose yi ~ iV(/3o + xf /3,cr 2 ), and the parameters f3j are each distributed as 
iV(0,T 2 ), independently of one another. Then the (negative) log-posterior 
density of /?, with r 2 and a 2 assumed known, is equal to the expression 
in curly braces in (3.41), with A = a 2 /r 2 (Exercise 3.6). Thus the ridge 
estimate is the mode of the posterior distribution; since the distribution is 
Gaussian, it is also the posterior mean. 

The singular value decomposition (SVD) of the centered input matrix X 
gives us some additional insight into the nature of ridge regression. This de¬ 
composition is extremely useful in the analysis of many statistical methods. 
The SVD of the N x p matrix X has the form 


X = udv t . 


(3.45) 
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df(A) 


FIGURE 3.8. Profiles of ridge coefficients for the prostate cancer example, as 
the tuning parameter A is varied. Coefficients are plotted versus df(A), the effective 
degrees of freedom. A vertical line is drawn at df = 5.0, the value chosen by 
cross-validation. 









66 


3. Linear Methods for Regression 


Here U and V are N x p and p x p orthogonal matrices, with the columns 
of U spanning the column space of X, and the columns of V spanning the 
row space. D is a p x p diagonal matrix, with diagonal entries d\ > d ,2 > 
■■■> d p > 0 called the singular values of X. If one or more values dj = 0, 
X is singular. 

Using the singular value decomposition we can write the least squares 
fitted vector as 


X/3 ls = X(X T X)" 1 X T y 
= UU T y, 


(3.46) 


after some simplification. Note that U T y are the coordinates of y with 
respect to the orthonormal basis U. Note also the similarity with (3.33); 
Q and U are generally different orthogonal bases for the column space of 
X (Exercise 3.8). 

Now the ridge solutions are 


X/? r idg e 


X(X T X + AI) -1 X T y 
U D(D 2 + AI) _1 D U T y 


V 

E U ^2 


dj rp 

U ■ y, 


3 = 1 


dj + X 


(3.47) 


where the are the columns of U. Note that since A > 0, we have g? 2 /(g? 2 + 
A) < 1. Like linear regression, ridge regression computes the coordinates of 
y with respect to the orthonormal basis U. It then shrinks these coordinates 
by the factors d 2 / (d 2 + A). This means that a greater amount of shrinkage 
is applied to the coordinates of basis vectors with smaller d 2 . 

What does a small value of d 2 mean? The SVD of the centered matrix 
X is another way of expressing the principal components of the variables 
in X. The sample covariance matrix is given by S = X T X/IV, and from 
(3.45) we have 

X T X = VD 2 V t , (3.48) 

which is the eigen decomposition of X T X (and of S, up to a factor TV). 
The eigenvectors v 3 (columns of V) are also called the principal compo¬ 
nents (or Karhunen-Loeve) directions of X. The first principal component 
direction vi has the property that zi = Xiq has the largest sample vari¬ 
ance amongst all normalized linear combinations of the columns of X. This 
sample variance is easily seen to be 

Var(z 1 )=Var(Xu 1 ) = ^, (3.49) 

and in fact Zi = Xiq = Uidi. The derived variable zi is called the first 
principal component of X, and hence ui is the normalized first principal 
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FIGURE 3.9. Principal components of some input data points. The largest prin¬ 
cipal component is the direction that maximizes the variance of the projected data, 
and the smallest principal component minimizes that variance. Ridge regression 
projects y onto these components, and then shrinks the coefficients of the low- 
variance components more than the high-variance components. 


component. Subsequent principal components z j have maximum variance 
dj/N, subject to being orthogonal to the earlier ones. Conversely the last 
principal component has minimum variance. Hence the small singular val¬ 
ues dj correspond to directions in the column space of X having small 
variance, and ridge regression shrinks these directions the most. 

Figure 3.9 illustrates the principal components of some data points in 
two dimensions. If we consider fitting a linear surface over this domain 
(the T-axis is sticking out of the page), the configuration of the data allow 
us to determine its gradient more accurately in the long direction than 
the short. Ridge regression protects against the potentially high variance 
of gradients estimated in the short directions. The implicit assumption is 
that the response will tend to vary most in the directions of high variance 
of the inputs. This is often a reasonable assumption, since predictors are 
often chosen for study because they vary with the response variable, but 
need not hold in general. 
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In Figure 3.7 we have plotted the estimated prediction error versus the 
quantity 

df(A) = tr[X(X T X + AI)- 1 X T ], 

= tr(H A ) 

- ±ifh- (3 - 50) 

j= i 3 

This monotone decreasing function of A is the effective degrees of freedom 
of the ridge regression fit. Usually in a linear-regression fit with p variables, 
the degrees-of-freedom of the fit is p, the number of free parameters. The 
idea is that although all p coefficients in a ridge fit will be non-zero, they 
are fit in a restricted fashion controlled by A. Note that df(A) = p when 
A = 0 (no regularization) and df(A) —> 0 as A —> oo. Of course there 
is always an additional one degree of freedom for the intercept, which was 
removed apriori. This definition is motivated in more detail in Section 3.4.4 
and Sections 7.4-7.6. In Figure 3.7 the minimum occurs at df(A) = 5.0. 
Table 3.3 shows that ridge regression reduces the test error of the full least 
squares estimates by a small amount. 

3-4-2 The Lasso 

The lasso is a shrinkage method like ridge, with subtle but important dif¬ 
ferences. The lasso estimate is defined by 

N p 2 

/3 lasso = argmill Y - do - Y XijPj) 

P 2 = 1 3=1 

p 

subject to £!&!<*• ( 3 - 51 ) 

j=i 

Just as in ridge regression, we can re-parametrize the constant /3 0 by stan¬ 
dardizing the predictors; the solution for /3o is y , and thereafter we fit a 
model without an intercept (Exercise 3.5). In the signal processing litera¬ 
ture, the lasso is also known as basis pursuit (Chen et al., 1998). 

We can also write the lasso problem in the equivalent Lagrangian form 

/giasso = argmin | i Y[ Vi - /3 0 - Y Xij Pj ) 2 + A Y l&I} ■ ( 3 - 52 ) 

Notice the similarity to the ridge regression problem (3.42) or (3.41): the 
L 2 ridge penalty Pj is replaced by the L\ lasso penalty Yli \fij\- This 
latter constraint makes the solutions nonlinear in the j/i, and there is no 
closed form expression as in ridge regression. Computing the lasso solution 
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is a quadratic programming problem, although we see in Section 3.4.4 that 
efficient algorithms are available for computing the entire path of solutions 
as A is varied, with the same computational cost as for ridge regression. 
Because of the nature of the constraint, making t sufficiently small will 
cause some of the coefficients to be exactly zero. Thus the lasso does a kind 
of continuous subset selection. If t is chosen larger than to = Yi I Bj I (where 
j3j = /5j s , the least squares estimates), then the lasso estimates are the Bj's. 
On the other hand, for t = to/ 2 say, then the least squares coefficients are 
shrunk by about 50% on average. However, the nature of the shrinkage 
is not obvious, and we investigate it further in Section 3.4.4 below. Like 
the subset size in variable subset selection, or the penalty parameter in 
ridge regression, t should be adaptively chosen to minimize an estimate of 
expected prediction error. 

In Figure 3.7, for ease of interpretation, we have plotted the lasso pre¬ 
diction error estimates versus the standardized parameter s = t/Yi \$j\- 
A value s « 0.36 was chosen by 10-fold cross-validation; this caused four 
coefficients to be set to zero (fifth column of Table 3.3). The resulting 
model has the second lowest test error, slightly lower than the full least 
squares model, but the standard errors of the test error estimates (last line 
of Table 3.3) are fairly large. 

Figure 3.10 shows the lasso coefficients as the standardized tuning pa¬ 
rameter s = t/Y// |/3jI is varied. At s = 1.0 these are the least squares 
estimates; they decrease to 0 as s —> 0. This decrease is not always strictly 
monotonic, although it is in this example. A vertical line is drawn at 
s = 0.36, the value chosen by cross-validation. 


3-4-3 Discussion: Subset Selection, Ridge Regression and the 
Lasso 

In this section we discuss and compare the three approaches discussed so far 
for restricting the linear regression model: subset selection, ridge regression 
and the lasso. 

In the case of an orthonormal input matrix X the three procedures have 
explicit solutions. Each method applies a simple transformation to the least 
squares estimate $j, as detailed in Table 3.4. 

Ridge regression does a proportional shrinkage. Lasso translates each 
coefficient by a constant factor A, truncating at zero. This is called “soft 
thresholding,” and is used in the context of wavelet-based smoothing in Sec¬ 
tion 5.9. Best-subset selection drops all variables with coefficients smaller 
than the Mth largest; this is a form of “hard-thresholding.” 

Back to the nonorthogonal case; some pictures help understand their re¬ 
lationship. Figure 3.11 depicts the lasso (left) and ridge regression (right) 
when there are only two parameters. The residual sum of squares has ellip¬ 
tical contours, centered at the full least squares estimate. The constraint 
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FIGURE 3.10. Profiles of lasso coefficients, as the tuning parameter t is varied. 
Coefficients are plotted versus s = t/ Y2i \Pj\- A vertical line is drawn at s = 0.36, 
the value chosen by cross-validation. Compare Figure 3.8 on page 65; the lasso 
profiles hit zero, while those for ridge do not. The profiles are piece-wise linear, 
and so are computed only at the points displayed; see Section 3.4-4 f or details. 
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TABLE 3.4. Estimators of /3j in the case of orthonormal columns of~X.. M and A 
are constants chosen by the corresponding techniques; sign denotes the sign of its 
argument (-\il), and x+ denotes “positive part” ofx. Below the table, estimators 
are shown by broken red lines. The 45° line in gray shows the unrestricted estimate 
for reference. 


Estimator 

Formula 


Best subset (size M) 

Pi ' I(\Pj\ > 1 P(M)\) 


Ridge 

Pj/( 1 + 


Lasso 

signal) {\Pj\ - A)+ 


Best Subset Ridge Lasso 
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FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression 
(right). Shown are contours of the error and constraint functions. The solid blue 
areas are the constraint regions \fi\\ + l/fcl < t and 01+02 < t 2 , respectively, 
while the red ellipses are the contours of the least squares error function. 
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region for ridge regression is the disk f3f + /3f < L while that for lasso is 
the diamond |/3i| + |/321 < t. Both methods find the first point where the 
elliptical contours hit the constraint region. Unlike the disk, the diamond 
has corners; if the solution occurs at a corner, then it has one parameter 
Pj equal to zero. When p > 2, the diamond becomes a rhomboid, and has 
many corners, flat edges and faces; there are many more opportunities for 
the estimated parameters to be zero. 

We can generalize ridge regression and the lasso, and view them as Bayes 
estimates. Consider the criterion 

{ N p p 

^2(yi - Po - x ijfo) 2 + X Y1 i&i'( ( 3 - 53 ) 

t= 1 3=1 3=1 J 

for q > 0. The contours of constant value of JV \Pj\ q are shown in Fig¬ 
ure 3.12, for the case of two inputs. 

Thinking of \(3j\ q as the log-prior density for (3j, these are also the equi- 
contours of the prior distribution of the parameters. The value q = 0 corre¬ 
sponds to variable subset selection, as the penalty simply counts the number 
of nonzero parameters; < 7=1 corresponds to the lasso, while q = 2 to ridge 
regression. Notice that for q < 1, the prior is not uniform in direction, but 
concentrates more mass in the coordinate directions. The prior correspond¬ 
ing to the q = 1 case is an independent double exponential (or Laplace) 
distribution for each input, with density (l/2r) exp(—|/3|/r) and r = 1/A. 
The case q = 1 (lasso) is the smallest q such that the constraint region 
is convex; non-convex constraint regions make the optimization problem 
more difficult. 

In this view, the lasso, ridge regression and best subset selection are 
Bayes estimates with different priors. Note, however, that they are derived 
as posterior modes, that is, maximizers of the posterior. It is more common 
to use the mean of the posterior as the Bayes estimate. Ridge regression is 
also the posterior mean, but the lasso and best subset selection are not. 

Looking again at the criterion (3.53), we might try using other values 
of q besides 0, 1, or 2. Although one might consider estimating q from 
the data, our experience is that it is not worth the effort for the extra 
variance incurred. Values of q G (1,2) suggest a compromise between the 
lasso and ridge regression. Although this is the case, with q > 1, \/3j\ q is 
differentiable at 0, and so does not share the ability of lasso (q = 1) for 


q = 4 q = 2 q = 1 q — 0.5 q = 0.1 



FIGURE 3.12. Contours of constant value of 5^. \Pj\ q f or given values of q. 
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FIGURE 3.13. Contours of constant value o/JA f or Q = 1-2 (left plot), 
and the elastic-net penalty ^2 j(a/3j +(l—a)\/3j\) fora = 0.2 (right plot). Although 
visually very similar, the elastic-net has sharp (non-differentiable) corners, while 
the q = 1.2 penalty does not. 

setting coefficients exactly to zero. Partly for this reason as well as for 
computational tractability, Zou and Hastie (2005) introduced the elastic- 
net penalty 

+ (!-«) I&l). (3-54) 

i=i 

a different compromise between ridge and lasso. Figure 3.13 compares the 
L q penalty with q = 1.2 and the elastic-net penalty with a = 0.2; it is 
hard to detect the difference by eye. The elastic-net selects variables like 
the lasso, and shrinks together the coefficients of correlated predictors like 
ridge. It also has considerable computational advantages over the L q penal¬ 
ties. We discuss the elastic-net further in Section 18.4. 

3-4-4 Least Angle Regression 

Least angle regression (LAR) is a relative newcomer (Efron et al., 2004), 
and can be viewed as a kind of “democratic” version of forward stepwise 
regression (Section 3.3.2). As we will see, LAR is intimately connected 
with the lasso, and in fact provides an extremely efficient algorithm for 
computing the entire lasso path as in Figure 3.10. 

Forward stepwise regression builds a model sequentially, adding one vari¬ 
able at a time. At each step, it identifies the best variable to include in the 
active set , and then updates the least squares fit to include all the active 
variables. 

Least angle regression uses a similar strategy, but only enters “as much” 
of a predictor as it deserves. At the first step it identifies the variable 
most correlated with the response. Rather than fit this variable completely, 
LAR moves the coefficient of this variable continuously toward its least- 
squares value (causing its correlation with the evolving residual to decrease 
in absolute value). As soon as another variable “catches up” in terms of 
correlation with the residual, the process is paused. The second variable 
then joins the active set, and their coefficients are moved together in a way 
that keeps their correlations tied and decreasing. This process is continued 
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until all the variables are in the model, and ends at the full least-squares 
fit. Algorithm 3.2 provides the details. The termination condition in step 5 
requires some explanation. If p > N — 1, the LAR algorithm reaches a zero 
residual solution after N—l steps (the —1 is because we have centered the 
data). 


Algorithm 3.2 Least Angle Regression. 

1. Standardize the predictors to have mean zero and unit norm. Start 
with the residual r = y — y, /3i, fa, ■ ■ •, fa = 0. 

2. Find the predictor Xj most correlated with r. 

3. Move fa from 0 towards its least-squares coefficient (xj, r), until some 
other competitor x*, has as much correlation with the current residual 
as does x 7 -. 

4. Move fa and (3k in the direction defined by their joint least squares 
coefficient of the current residual on (x 7 ,Xfc), until some other com¬ 
petitor x/ has as much correlation with the current residual. 

5. Continue in this way until all p predictors have been entered. After 
min(iV — 1 ,p) steps, we arrive at the full least-squares solution. 


Suppose Ak is the active set of variables at the beginning of the £;th 
step, and let fa\ k be the coefficient vector for these variables at this step; 
there will be k — 1 nonzero values, and the one just entered will be zero. If 
rfc = y — j3j\^ k is the current residual, then the direction for this step is 

4 = (X^X^J-^r*. (3-55) 

The coefficient profile then evolves as /4u (a) = Aa*, + a ■ 4- Exercise 3.23 
verifies that the directions chosen in this fashion do what is claimed: keep 
the correlations tied and decreasing. If the fit vector at the beginning of 
this step is 4, then it evolves as 4 (a) = 4 + a • u*,, where u*, = X^ fc 4 
is the new fit direction. The name “least angle” arises from a geometrical 
interpretation of this process; u*, makes the smallest (and equal) angle 
with each of the predictors in Ak (Exercise 3.24). Figure 3.14 shows the 
absolute correlations decreasing and joining ranks with each step of the 
LAR algorithm, using simulated data. 

By construction the coefficients in LAR change in a piecewise linear fash¬ 
ion. Figure 3.15 [left panel] shows the LAR coefficient profile evolving as a 
function of their Li arc length 2 . Note that we do not need to take small 


2 The L i arc-length of a differentiable curve 4(V) for s G [0, S’] is given by TV(/3, S) = 
Jg |/3( s )l I 1 ds, where ,'i(s) = d/3(s)/ds. For the piecewise-linear LAR coefficient profile, 
this amounts to summing the Li norms of the changes in coefficients from step to step. 
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FIGURE 3.14. Progression of the absolute correlations during each step of the 
LAR procedure, using a simulated data set with six predictors. The labels at the 
top of the plot indicate which variables enter the active set at each step. The step 
length are measured in units of Li arc length. 


Least Angle Regression 


Lasso 




L i Arc Length Li Arc Length 


FIGURE 3.15. Left panel shows the LAR coefficient profiles on the simulated 
data, as a function of the L i arc length. The right panel shows the Lasso profile. 
They are identical until the dark-blue coefficient crosses zero at an arc length of 
about 18. 


















76 


3. Linear Methods for Regression 


steps and recheck the correlations in step 3; using knowledge of the covari¬ 
ance of the predictors and the piecewise linearity of the algorithm, we can 
work out the exact step length at the beginning of each step (Exercise 3.25). 

The right panel of Figure 3.15 shows the lasso coefficient profiles on the 
same data. They are almost identical to those in the left panel, and differ 
for the first time when the blue coefficient passes back through zero. For the 
prostate data, the LAR coefficient profile turns out to be identical to the 
lasso profile in Figure 3.10, which never crosses zero. These observations 
lead to a simple modification of the LAR algorithm that gives the entire 
lasso path, which is also piecewise-linear. 


Algorithm 3.2a Least Angle Regression: Lasso Modification. 


4a. If a non-zero coefficient hits zero, drop its variable from the active set 
of variables and recompute the current joint least squares direction. 


The LAR(lasso) algorithm is extremely efficient, requiring the same order 
of computation as that of a single least squares fit using the p predictors. 
Least angle regression always takes p steps to get to the full least squares 
estimates. The lasso path can have more than p steps, although the two 
are often quite similar. Algorithm 3.2 with the lasso modification 3.2a is 
an efficient way of computing the solution to any lasso problem, especially 
when N. Osborne et al. (2000a) also discovered a piecewise-linear path 
for computing the lasso, which they called a homotopy algorithm. 

We now give a heuristic argument for why these procedures are so similar. 
Although the LAR algorithm is stated in terms of correlations, if the input 
features are standardized, it is equivalent and easier to work with inner- 
products. Suppose A is the active set of variables at some stage in the 
algorithm, tied in their absolute inner-product with the current residuals 
y — X/3. We can express this as 

X J (y - x /?) = 7 ' Sj, Vj e A (3.56) 

where Sj 6 {—1,1} indicates the sign of the inner-product, and 7 is the 
common value. Also Ix^ (y — X/3) | < 7 V/c ^ A. Now consider the lasso 
criterion (3.52), which we write in vector form 

#(/?) = illy-X/^ + APHr. (3.57) 

Let B be the active set of variables in the solution for a given value of A. 
For these variables i?(/3) is differentiable, and the stationarity conditions 
give 

xj(y - x /?) = A • sign(/3j), Vj e B (3.58) 

Comparing (3.58) with (3.56), we see that they are identical only if the 
sign of ftj matches the sign of the inner product. That is why the LAR 
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algorithm and lasso start to differ when an active coefficient passes through 
zero; condition (3.58) is violated for that variable, and it is kicked out of the 
active set B. Exercise 3.23 shows that these equations imply a piecewise- 
linear coefficient profile as A decreases. The stationarity conditions for the 
non-active variables require that 


| X fc (y - X/?)| < A, 


(3.59) 


which again agrees with the LAR algorithm. 

Figure 3.16 compares LAR and lasso to forward stepwise and stagewise 
regression. The setup is the same as in Figure 3.6 on page 59, except here 
N = 100 here rather than 300, so the problem is more difficult. We see 
that the more aggressive forward stepwise starts to overfit quite early (well 
before the 10 true variables can enter the model), and ultimately performs 
worse than the slower forward stagewise regression. The behavior of LAR 
and lasso is similar to that of forward stagewise regression. Incremental 
forward stagewise is similar to LAR and lasso, and is described in Sec¬ 
tion 3.8.1. 

Degrees-of-Freedom Formula for LAR and Lasso 

Suppose that we fit a linear model via the least angle regression procedure, 
stopping at some number of steps k < p, or equivalently using a lasso bound 
t that produces a constrained version of the full least squares fit. How many 
parameters, or “degrees of freedom” have we used? 

Consider first a linear regression using a subset of k features. If this subset 
is prespecified in advance without reference to the training data, then the 
degrees of freedom used in the fitted model is defined to be k. Indeed, in 
classical statistics, the number of linearly independent parameters is what 
is meant by “degrees of freedom.” Alternatively, suppose that we carry out 
a best subset selection to determine the “optimal” set of k predictors. Then 
the resulting model has k parameters, but in some sense we have used up 
more than k degrees of freedom. 

We need a more general definition for the effective degrees of freedom of 
an adaptively fitted model. We define the degrees of freedom of the fitted 
vector y = (yi, j/ 2 ,.. •, m) as 



(3.60) 


Here Cov(yi,yi) refers to the sampling covariance between the predicted 
value {ji and its corresponding outcome value y r . This makes intuitive sense: 
the harder that we fit to the data, the larger this covariance and hence 
df(y). Expression (3.60) is a useful notion of degrees of freedom, one that 
can be applied to any model prediction y. This includes models that are 
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FIGURE 3.16. Comparison of LAR and lasso with forward stepwise, forward 
stagewise (FS) and incremental forward stagewise (FSo) regression. The setup 
is the same as in Figure 3.6, except N = 100 here rather than 300. Here the 
slower FS regression ultimately outperforms forward stepwise. LAR and lasso 
show similar behavior to FS and FSo. Since the procedures take different numbers 
of steps (across simulation replicates and methods), we plot the MSE as a function 
of the fraction of total L\ arc-length toward the least-squares fit. 


adaptively fitted to the training data. This definition is motivated and 
discussed further in Sections 7.4-7.6. 

Now for a linear regression with k fixed predictors, it is easy to show 
that df(y) = k. Likewise for ridge regression, this definition leads to the 
closed-form expression (3.50) on page 68: df(y) = tr(SA). In both these 
cases, (3.60) is simple to evaluate because the fit y = H^y is linear in y. 
If we think about definition (3.60) in the context of a best subset selection 
of size k, it seems clear that df(y) will be larger than k , and this can be 
verified by estimating Cov(yi,yi)/a 2 directly by simulation. However there 
is no closed form method for estimating df(y) for best subset selection. 

For LAR and lasso, something magical happens. These techniques are 
adaptive in a smoother way than best subset selection, and hence estimation 
of degrees of freedom is more tractable. Specifically it can be shown that 
after the fcth step of the LAR procedure, the effective degrees of freedom of 
the fit vector is exactly k. Now for the lasso, the (modified) LAR procedure 
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often takes more than p steps, since predictors can drop out. Hence the 
definition is a little different; for the lasso, at any stage df(y) approximately 
equals the number of predictors in the model. While this approximation 
works reasonably well anywhere in the lasso path, for each k it works best 
at the last model in the sequence that contains k predictors. A detailed 
study of the degrees of freedom for the lasso may be found in Zou et al. 
(2007). 

3.5 Methods Using Derived Input Directions 

In many situations we have a large number of inputs, often very correlated. 
The methods in this section produce a small number of linear combinations 
Z m , to = 1,..., M of the original inputs Xj, and the Z m are then used in 
place of the Xj as inputs in the regression. The methods differ in how the 
linear combinations are constructed. 

3.5.1 Principal Components Regression 

In this approach the linear combinations Z m used are the principal com¬ 
ponents as defined in Section 3.4.1 above. 

Principal component regression forms the derived input columns z m = 
Xv m , and then regresses y on Zi, z 2 ,..., z m for some M < p. Since the z m 
are orthogonal, this regression is just a sum of univariate regressions: 

M 

y(M) = V 1 + (3-61) 

m =1 

where 6 m = (z m ,y)/(z m ,z m ). Since the z m are each linear combinations 
of the original Xj, we can express the solution (3.61) in terms of coefficients 
of the Xj (Exercise 3.13): 

M 

/3 pcr (M) = ^ § m v m . (3.62) 

m =1 

As with ridge regression, principal components depend on the scaling of 
the inputs, so typically we first standardize them. Note that if M = p, we 
would just get back the usual least squares estimates, since the columns of 
Z = UD span the column space of X. For M < p we get a reduced regres¬ 
sion. We see that principal components regression is very similar to ridge 
regression: both operate via the principal components of the input ma¬ 
trix. Ridge regression shrinks the coefficients of the principal components 
(Figure 3.17), shrinking more depending on the size of the corresponding 
eigenvalue; principal components regression discards the p — M smallest 
eigenvalue components. Figure 3.17 illustrates this. 
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FIGURE 3.17. Ridge regression shrinks the regression coefficients of the prin¬ 
cipal components, using shrinkage factors dj/(d + A) as in (3.47). Principal 
component regression truncates them. Shown are the shrinkage and truncation 
patterns corresponding to Figure 3.7, as a function of the principal component 
index. 


In Figure 3.7 we see that cross-validation suggests seven terms; the re¬ 
sulting model has the lowest test error in Table 3.3. 


3.5.2 Partial Least Squares 

This technique also constructs a set of linear combinations of the inputs 
for regression, but unlike principal components regression it uses y (in ad¬ 
dition to X) for this construction. Like principal component regression, 
partial least squares (PLS) is not scale invariant, so we assume that each 
Xj is standardized to have mean 0 and variance 1. PLS begins by com¬ 
puting fiij = (xj, y) for each j. From this we construct the derived input 
zi = (CjVUjXj, which is the first partial least squares direction. Hence 
in the construction of each z m , the inputs are weighted by the strength 
of their univariate effect on y 3 . The outcome y is regressed on zi giving 
coefficient 9 1, and then we orthogonalize xi,.. . ,x p with respect to zi. We 
continue this process, until M < p directions have been obtained. In this 
manner, partial least squares produces a sequence of derived, orthogonal 
inputs or directions zi,Z2,...,z m- As with principal-component regres¬ 
sion, if we were to construct all M = p directions, we would get back a 
solution equivalent to the usual least squares estimates; using M < p di¬ 
rections produces a reduced regression. The procedure is described fully in 
Algorithm 3.3. 


3 Since the x, are standardized, the first directions f ] j are the univariate regression 
coefficients (up to an irrelevant constant); this is not the case for subsequent directions. 
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Algorithm 3.3 Partial Least Squares. 

1. Standardize each x,- to have mean zero and variance one. Set y( 0> = 
2/1, and x^ 0) = Xj, j = 1,... ,p. 

2. For m = 1,2,... ,p 

(a) z m = 1 where <£ mj - = (xj m_1) ,y). 

(b) = (z m ,y)/(z m ,z m ). 

(c) y (m) = y (m - 1) + 0 m z m . 

(d) Orthogonalize each x^” 1 with respect to z m : xj m ^ = x^ m_1 ^ — 
[(^mi A' ) /(Zm, Z m )]z m , J = 1, 2, . . . ,p. 

3. Output the sequence of fitted vectors {y^jq. Since the {zare 
linear in the original Xj, so is y( m ) = X/3 pls (m). These linear coeffi¬ 
cients can be recovered from the sequence of PLS transformations. 


In the prostate cancer example, cross-validation chose M = 2 PLS direc¬ 
tions in Figure 3.7. This produced the model given in the rightmost column 
of Table 3.3. 

What optimization problem is partial least squares solving? Since it uses 
the response y to construct its directions, its solution path is a nonlinear 
function of y. It can be shown (Exercise 3.15) that partial least squares 
seeks directions that have high variance and have high correlation with the 
response, in contrast to principal components regression which keys only 
on high variance (Stone and Brooks, 1990; Frank and Friedman, 1993). In 
particular, the mth principal component direction v m solves: 

max a Var(Xa) (3.63) 

subject to |Id'll = 1 , a T Sve = 0, i = 1,...,to — 1 , 

where S is the sample covariance matrix of the Xj. The conditions a T Sve = 
0 ensures that z m = Xa is uncorrelated with all the previous linear com¬ 
binations z^ = Xu^. The TOth PLS direction <p m solves: 

max a Corr 2 (y, Xa)Var(Xa) (3.64) 

subject to | |ck| | = 1, a T Sfie = 0, i = 1,..., to — 1. 

Further analysis reveals that the variance aspect tends to dominate, and 
so partial least squares behaves much like ridge regression and principal 
components regression. We discuss this further in the next section. 

If the input matrix X is orthogonal, then partial least squares finds the 
least squares estimates after to = 1 steps. Subsequent steps have no effect 
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since the (p m j are zero for m > 1 (Exercise 3.14). It can also be shown that 
the sequence of PLS coefficients for m = 1,2,... ,p represents the conjugate 
gradient sequence for computing the least squares solutions (Exercise 3.18). 


3.6 Discussion: A Comparison of the Selection and 
Shrinkage Methods 

There are some simple settings where we can understand better the rela¬ 
tionship between the different methods described above. Consider an exam¬ 
ple with two correlated inputs Xi and X 2 , with correlation p. We assume 
that the true regression coefficients are pi = 4 and /3 2 = 2. Figure 3.18 
shows the coefficient profiles for the different methods, as their tuning pa¬ 
rameters are varied. The top panel has p = 0.5, the bottom panel p = —0.5. 
The tuning parameters for ridge and lasso vary over a continuous range, 
while best subset, PLS and PCR take just two discrete steps to the least 
squares solution. In the top panel, starting at the origin, ridge regression 
shrinks the coefficients together until it finally converges to least squares. 
PLS and PCR show similar behavior to ridge, although are discrete and 
more extreme. Best subset overshoots the solution and then backtracks. 
The behavior of the lasso is intermediate to the other methods. When the 
correlation is negative (lower panel), again PLS and PCR roughly track 
the ridge path, while all of the methods are more similar to one another. 

It is interesting to compare the shrinkage behavior of these different 
methods. Recall that ridge regression shrinks all directions, but shrinks 
low-variance directions more. Principal components regression leaves M 
high-variance directions alone, and discards the rest. Interestingly, it can 
be shown that partial least squares also tends to shrink the low-variance 
directions, but can actually inflate some of the higher variance directions. 
This can make PLS a little unstable, and cause it to have slightly higher 
prediction error compared to ridge regression. A full study is given in Frank 
and Friedman (1993). These authors conclude that for minimizing predic¬ 
tion error, ridge regression is generally preferable to variable subset selec¬ 
tion, principal components regression and partial least squares. However 
the improvement over the latter two methods was only slight. 

To summarize, PLS, PCR and ridge regression tend to behave similarly. 
Ridge regression may be preferred because it shrinks smoothly, rather than 
in discrete steps. Lasso falls somewhere between ridge regression and best 
subset regression, and enjoys some of the properties of each. 
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FIGURE 3.18. Coefficient profiles from different methods for a simple problem: 
two inputs with correlation ±0.5, and the true regression coefficients f3 = (4,2). 
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3.7 Multiple Outcome Shrinkage and Selection 

As noted in Section 3.2.4, the least squares estimates in a multiple-output 
linear model are simply the individual least squares estimates for each of 
the outputs. 

To apply selection and shrinkage methods in the multiple output case, 
one could apply a univariate technique individually to each outcome or si¬ 
multaneously to all outcomes. With ridge regression, for example, we could 
apply formula (3.44) to each of the I\ columns of the outcome matrix Y, 
using possibly different parameters A, or apply it to all columns using the 
same value of A. The former strategy would allow different amounts of 
regularization to be applied to different outcomes but require estimation 
of k separate regularization parameters Ai,..., A*,, while the latter would 
permit all k outputs to be used in estimating the sole regularization pa¬ 
rameter A. 

Other more sophisticated shrinkage and selection strategies that exploit 
correlations in the different responses can be helpful in the multiple output 
case. Suppose for example that among the outputs we have 

Y k = f(X) + e k (3.65) 

Y e = f(X)+er, (3.66) 

i.e., (3.65) and (3.66) share the same structural part f(X ) in their models. 
It is clear in this case that we should pool our observations on Y k and Yi 
to estimate the common /. 

Combining responses is at the heart of canonical correlation analysis 
(CCA), a data reduction technique developed for the multiple output case. 
Similar to PCA, CCA finds a sequence of uncorrelated linear combina¬ 
tions Xv m , m = 1, ...,M of the Xj, and a corresponding sequence of 
uncorrelated linear combinations Y u m of the responses y k , such that the 
correlations 



Corr 2 (Y u m , Xv m ) (3.67) 

are successively maximized. Note that at most M = min(A',p) directions 
can be found. The leading canonical response variates are those linear com¬ 
binations (derived responses) best predicted by the x,; in contrast, the 
trailing canonical variates can be poorly predicted by the Xj, and are can¬ 
didates for being dropped. The CCA solution is computed using a general¬ 
ized SVD of the sample cross-covariance matrix Y T X/iV (assuming Y and 
X are centered; Exercise 3.20). 

Reduced-rank regression (Izenman, 1975; van der Merwe and Zidek, 1980) 
formalizes this approach in terms of a regression model that explicitly pools 
information. Given an error covariance Cov(e) = X, we solve the following 
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restricted multivariate regression problem: 

N 

B rr (m) = argmin ^(j/j - B T Xi) T 'E -1 (yi - B T a; i ). (3.68) 

rank(B)=ra 

With X replaced by the estimate Y T Y/TV, one can show (Exercise 3.21) 
that the solution is given by a CCA of Y and X: 

B rr (rn) = BU m U-, (3.69) 

where U m is the K x m sub-matrix of U consisting of the first m columns, 
and U is the K x M matrix of left canonical vectors Ui,tt2 ,... ,um- U“ 
is its generalized inverse. Writing the solution as 

B rr (M) = (X T X)- 1 X T (YU m )U-, (3.70) 


we see that reduced-rank regression performs a linear regression on the 
pooled response matrix YU m , and then maps the coefficients (and hence 
the fits as well) back to the original response space. The reduced-rank fits 
are given by 


Y rr (m) = X(X T X)~ 1 X T YU m U) 

= HYP m , 


(3.71) 


where H is the usual linear regression projection operator, and P m is the 
rank-rn CCA response projection operator. Although a better estimate of 
X would be (Y — XB) T (Y — XB)/( N—pK ), one can show that the solution 
remains the same (Exercise 3.22). 

Reduced-rank regression borrows strength among responses by truncat¬ 
ing the CCA. Breiman and Friedman (1997) explored with some success 
shrinkage of the canonical variates between X and Y, a smooth version of 
reduced rank regression. Their proposal has the form (compare (3.69)) 

B c+W = BUAU -1 , (3.72) 


where A is a diagonal shrinkage matrix (the “c+w” stands for “Curds 
and Whey,” the name they gave to their procedure). Based on optimal 
prediction in the population setting, they show that A has diagonal entries 


'Vn. — 


c 2 . V_(l _ „2 ) 
c m w jy x-l ° m ) 


m = 1, 


,M, 


(3.73) 


where c m is the mth canonical correlation coefficient. Note that as the ratio 
of the number of input variables to sample size p/N gets small, the shrink¬ 
age factors approach 1. Breiman and Friedman (1997) proposed modified 
versions of A based on training data and cross-validation, but the general 
form is the same. Here the fitted response has the form 


Y c +w _ jjYS c+w 


(3.74) 
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where S c+W = UAU -1 is the response shrinkage operator. 

Breiman and Friedman (1997) also suggested shrinking in both the Y 
space and X space. This leads to hybrid shrinkage models of the form 

yridge.c+w = AaY S C+W , (3.75) 

where Aa = X(X T X +AI) _1 X T is the ridge regression shrinkage operator, 
as in (3.46) on page 66. Their paper and the discussions thereof contain 
many more details. 


3.8 More on the Lasso and Related Path 
Algorithms 

Since the publication of the LAR algorithm (Efron et ah, 2004) there has 
been a lot of activity in developing algorithms for fitting regularization 
paths for a variety of different problems. In addition, L\ regularization has 
taken on a life of its own, leading to the development of the held compressed 
sensing in the signal-processing literature. (Donoho, 2006a; Candes, 2006). 
In this section we discuss some related proposals and other path algorithms, 
starting off with a precursor to the LAR algorithm. 

3.8.1 Incremental Forward Stagewise Regression 

Here we present another LAR-like algorithm, this time focused on forward 
stagewise regression. Interestingly, efforts to understand a flexible nonlinear 
regression procedure (boosting) led to a new algorithm for linear models 
(LAR). In reading the first edition of this book and the forward stagewise 


Algorithm 3.4 Incremental Forward Stagewise Regression — FS e . 

1. Start with the residual r equal to y and ... ,fi p = 0. All the 

predictors are standardized to have mean zero and unit norm. 

2. Find the predictor x, most correlated with r 

3. Update B-j <— Bj + where 5j = e - sign[(xj, r)] and e > 0 is a small 
step size, and set r r — SjXj. 

4. Repeat steps 2 and 3 many times, until the residuals are uncorrelated 
with all the predictors. 


Algorithm 16.1 of Chapter 16 4 , our colleague Brad Efron realized that with 


4 In the first edition, this was Algorithm 10.4 in Chapter 10. 
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FSe 


FSo 




FIGURE 3.19. Coefficient profiles for the prostate data. The left panel shows 
incremental forward stagewise regression with step size e = 0.01. The right panel 
shows the infinitesimal version FSo obtained letting e —> 0. This profile was fit by 
the modification 3.2b to the LAR Algorithm 3.2. In this example the FSo profiles 
are monotone, and hence identical to those of lasso and LAR. 


linear models, one could explicitly construct the piecewise-linear lasso paths 
of Figure 3.10. This led him to propose the LAR procedure of Section 3.4.4, 
as well as the incremental version of forward-stagewise regression presented 
here. 

Consider the linear-regression version of the forward-stagewise boosting 
algorithm 16.1 proposed in Section 16.1 (page 608). It generates a coefficient 
profile by repeatedly updating (by a small amount e) the coefficient of the 
variable most correlated with the current residuals. Algorithm 3.4 gives 
the details. Figure 3.19 (left panel) shows the progress of the algorithm on 
the prostate data with step size e = 0.01. If Sj = (xj, r) (the least-squares 
coefficient of the residual on jth predictor), then this is exactly the usual 
forward stagewise procedure (FS) outlined in Section 3.3.3. 

Here we are mainly interested in small values of e. Letting e —> 0 gives 
the right panel of Figure 3.19, which in this case is identical to the lasso 
path in Figure 3.10. We call this limiting procedure infinitesimal forward 
stagewise regression or FSo- This procedure plays an important role in 
non-linear, adaptive methods like boosting (Chapters 10 and 16) and is the 
version of incremental forward stagewise regression that is most amenable 
to theoretical analysis. Biihlmann and Hothorn (2007) refer to the same 
procedure as “L2boost”, because of its connections to boosting. 
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Efron originally thought that the LAR Algorithm 3.2 was an implemen¬ 
tation of FSo, allowing each tied predictor a chance to update their coeffi¬ 
cients in a balanced way, while remaining tied in correlation. However, he 
then realized that the LAR least-squares fit amongst the tied predictors 
can result in coefficients moving in the opposite direction to their correla¬ 
tion, which cannot happen in Algorithm 3.4. The following modification of 
the LAR algorithm implements FSq: 


Algorithm 3.2b Least Angle Regression: FSq Modification. 

4. Find the new direction by solving the constrained least squares prob¬ 
lem 


min ||r — X^6| |f subject to bjSj > 0, j £ A , 
b 

where Sj is the sign of (xj, r). 


The modification amounts to a non-negative least squares fit, keeping the 
signs of the coefficients the same as those of the correlations. One can show 
that this achieves the optimal balancing of infinitesimal “update turns” 
for the variables tied for maximal correlation (Hastie et ah, 2007). Like 
lasso, the entire FSo path can be computed very efficiently via the LAR 
algorithm. 

As a consequence of these results, if the LAR profiles are monotone non¬ 
increasing or non-decreasing, as they are in Figure 3.19, then all three 
methods—LAR, lasso, and FSo—give identical profiles. If the profiles are 
not monotone but do not cross the zero axis, then LAR and lasso are 
identical. 

Since FSo is different from the lasso, it is natural to ask if it optimizes 
a criterion. The answer is more complex than for lasso; the FSo coefficient 
profile is the solution to a differential equation. While the lasso makes op¬ 
timal progress in terms of reducing the residual sum-of-squares per unit 
increase in Lj-norm of the coefficient vector /?, FSo is optimal per unit 
increase in L\ arc-length traveled along the coefficient path. Hence its co¬ 
efficient path is discouraged from changing directions too often. 

FSo is more constrained than lasso, and in fact can be viewed as a mono¬ 
tone version of the lasso; see Figure 16.3 on page 614 for a dramatic exam¬ 
ple. FSo may be useful in p > JV situations, where its coefficient profiles 
are much smoother and hence have less variance than those of lasso. More 
details on FSo are given in Section 16.2.3 and Hastie et al. (2007). Fig¬ 
ure 3.16 includes FSo where its performance is very similar to that of the 
lasso. 
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3.8.2 Piecewise-Linear Path Algorithms 

The least angle regression procedure exploits the piecewise linear nature of 
the lasso solution paths. It has led to similar “path algorithms” for other 
regularized problems. Suppose we solve 

/3(A) = argming [R(/3) + AJ(/3)], (3.76) 

with 

N p 

m = E L{yi, /3q + (3.77) 

i = l 3=1 

where both the loss function L and the penalty function J are convex. 
Then the following are sufficient conditions for the solution path /3(A) to 
be piecewise linear (Rosset and Zhu, 2007): 

1. R is quadratic or piecewise-quadratic as a function of /3, and 

2. J is piecewise linear in /?. 

This also implies (in principle) that the solution path can be efficiently 
computed. Examples include squared- and absolute-error loss, “Huberized” 
losses, and the penalties on /3. Another example is the “hinge loss” 

function used in the support vector machine. There the loss is piecewise 
linear, and the penalty is quadratic. Interestingly, this leads to a piecewise- 
linear path algorithm in the dual space ; more details are given in Sec¬ 
tion 12.3.5. 

3.8.3 The Dantzig Selector 

Candes and Tao (2007) proposed the following criterion: 

min^||/3||i subject to ||X T (y - X/3)||oo < s. (3.78) 

They call the solution the Dantzig selector (DS). It can be written equiva¬ 
lently as 

ming||X T (y - X/?)^ subject to ||/3||i < t. (3.79) 

Here || ■ ||oo denotes the L^ norm, the maximum absolute value of the 
components of the vector. In this form it resembles the lasso, replacing 
squared error loss by the maximum absolute value of its gradient. Note 
that as t gets large, both procedures yield the least squares solution if 
N < p. If p > N, they both yield the least squares solution with minimum 
Li norm. However for smaller values of t, the DS procedure produces a 
different path of solutions than the lasso. 

Candes and Tao (2007) show that the solution to DS is a linear pro¬ 
gramming problem; hence the name Dantzig selector, in honor of the late 
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George Dantzig, the inventor of the simplex method for linear program¬ 
ming. They also prove a number of interesting mathematical properties for 
the method, related to its ability to recover an underlying sparse coeffi¬ 
cient vector. These same properties also hold for the lasso, as shown later 
by Bickel et al. (2008). 

Unfortunately the operating properties of the DS method are somewhat 
unsatisfactory. The method seems similar in spirit to the lasso, especially 
when we look at the lasso’s stationary conditions (3.58). Like the LAR al¬ 
gorithm, the lasso maintains the same inner product (and correlation) with 
the current residual for all variables in the active set, and moves their co¬ 
efficients to optimally decrease the residual sum of squares. In the process, 
this common correlation is decreased monotonically (Exercise 3.23), and at 
all times this correlation is larger than that for non-active variables. The 
Dantzig selector instead tries to minimize the maximum inner product of 
the current residual with all the predictors. Hence it can achieve a smaller 
maximum than the lasso, but in the process a curious phenomenon can 
occur. If the size of the active set is m, there will be m variables tied with 
maximum correlation. However, these need not coincide with the active set! 
Hence it can include a variable in the model that has smaller correlation 
with the current residual than some of the excluded variables (Efron et 
al., 2007). This seems unreasonable and may be responsible for its some¬ 
times inferior prediction accuracy. Efron et al. (2007) also show that DS 
can yield extremely erratic coefficient paths as the regularization parameter 
s is varied. 


3.8.4 The Grouped Lasso 

In some problems, the predictors belong to pre-defined groups; for example 
genes that belong to the same biological pathway, or collections of indicator 
(dummy) variables for representing the levels of a categorical predictor. In 
this situation it may be desirable to shrink and select the members of a 
group together. The grouped lasso is one way to achieve this. Suppose that 
the p predictors are divided into L groups, with pe the number in group 
i. For ease of notation, we use a matrix to represent the predictors 
corresponding to the It h group, with corresponding coefficient vector f3e- 
The grouped-lasso minimizes the convex criterion 


min 

/3eiR p 


||y - A>i - E x ^Hi + a E VPtWhlU 


£=1 


t =1 


(3.80) 


where the y/pe terms accounts for the varying group sizes, and || • H2 is 
the Euclidean norm (not squared). Since the Euclidean norm of a vector 
Pi is zero only if all of its components are zero, this procedure encourages 
sparsity at both the group and individual levels. That is, for some values of 
A, an entire group of predictors may drop out of the model. This procedure 
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was proposed by Bakin (1999) and Lin and Zhang (2006), and studied and 
generalized by Yuan and Lin (2007). Generalizations include more general 
L 2 norms \\u\\k = (jl T Kr]) 1 ' 2 , as well as allowing overlapping groups of 
predictors (Zhao et al., 2008). There are also connections to methods for 
fitting sparse additive models (Lin and Zhang, 2006; Ravikumar et al., 
2008). 

3.8.5 Further Properties of the Lasso 

A number of authors have studied the ability of the lasso and related pro¬ 
cedures to recover the correct model, as N and p grow. Examples of this 
work include Knight and Fu (2000), Greenshtein and Ritov (2004), Tropp 
(2004), Donoho (2006b), Meinshausen (2007), Meinshausen and Buhlmann 
(2006), Tropp (2006), Zhao and Yu (2006), Wainwright (2006), and Bunea 
et al. (2007). For example Donoho (2006b) focuses on the p > N case and 
considers the lasso solution as the bound t gets large. In the limit this gives 
the solution with minimum L\ norm among all models with zero training 
error. He shows that under certain assumptions on the model matrix X, if 
the true model is sparse, this solution identifies the correct predictors with 
high probability. 

Many of the results in this area assume a condition on the model matrix 
of the form 

max ||x^X 5 (X < 5 T X l 5)~ 1 || 1 < (1 — e) for some e £ (0,1]. (3.81) 

Here S indexes the subset of features with non-zero coefficients in the true 
underlying model, and X$ are the columns of X corresponding to those 
features. Similarly S c are the features with true coefficients equal to zero, 
and Xgo the corresponding columns. This says that the least squares coef¬ 
ficients for the columns of Xgc on X 5 are not too large, that is, the “good” 
variables S are not too highly correlated with the nuisance variables S c . 

Regarding the coefficients themselves, the lasso shrinkage causes the esti¬ 
mates of the non-zero coefficients to be biased towards zero, and in general 
they are not consistent 5 . One approach for reducing this bias is to run 
the lasso to identify the set of non-zero coefficients, and then fit an un¬ 
restricted linear model to the selected set of features. This is not always 
feasible, if the selected set is large. Alternatively, one can use the lasso to 
select the set of non-zero predictors, and then apply the lasso again, but 
using only the selected predictors from the first step. This is known as the 
relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to 
estimate the initial penalty parameter for the lasso, and then again for a 
second penalty parameter applied to the selected set of predictors. Since 


5 Statistical consistency means as the sample size grows, the estimates converge to 
the true values. 
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the variables in the second step have less “competition” from noise vari¬ 
ables, cross-validation will tend to pick a smaller value for A, and hence 
their coefficients will be shrunken less than those in the initial estimate. 

Alternatively, one can modify the lasso penalty function so that larger co¬ 
efficients are shrunken less severely; the smoothly clipped absolute deviation 
(SCAD) penalty of Fan and Li (2005) replaces A|/3| by J a (/3,X), where 


dJ a {/3,\) 

d/3 


A • sign(/3) /(|/?| < A) 


M-|/3|) + 

(u — 1)A 


m\ > a ) 


(3.82) 


for some a > 2. The second term in square-braces reduces the amount of 
shrinkage in the lasso for larger values of /3, with ultimately no shrinkage 
as a —> oo. Figure 3.20 shows the SCAD penalty, along with the lasso and 


\P\ SCAD I/3I 1 -" 





0/3 0 


FIGURE 3.20. The lasso and two alternative non-convex penalties designed to 
penalize large coefficients less. For SCAD we use A = 1 and a = 4, and i> = | in 
the last panel. 

. However this criterion is non-convex, which is a drawback since it 
makes the computation much more difficult. The adaptive lasso (Zou, 2006) 
uses a weighted penalty of the form X^=i w i\Pj\ where Wj = 1/|/3 J | 1 ', /3 j is 
the ordinary least squares estimate and v > 0. This is a practical approxi¬ 
mation to the \P\ q penalties (q = 1 — v here) discussed in Section 3.4.3. The 
adaptive lasso yields consistent estimates of the parameters while retaining 
the attractive convexity property of the lasso. 

3.8.6 Pathwise Coordinate Optimization 

An alternate approach to the LARS algorithm for computing the lasso 
solution is simple coordinate descent. This idea was proposed by Fu (1998) 
and Daubechies et al. (2004), and later studied and generalized by Friedman 
et al. (2007), Wu and Lange (2008) and others. The idea is to fix the penalty 
parameter A in the Lagrangian form (3.52) and optimize successively over 
each parameter, holding the other parameters fixed at their current values. 

Suppose the predictors are all standardized to have mean zero and unit 
norm. Denote by /3^(A) the current estimate for /3fc at penalty parameter 
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A. We can rearrange (3.52) to isolate /3j, 

^ E ( y* - E^to - Xij ^ + aE itoi +mi 

*=1 V kjtj ) kjtj 

(3.83) 

where we have suppressed the intercept and introduced a factor ) for con¬ 
venience. This can be viewed as a univariate lasso problem with response 
variable the partial residual yt — = yi — x ikPk{ A)- This has an 

explicit solution, resulting in the update 

0j{ A) ( 3 - 84 ) 


Here S(t, A) = sign(t)(|t| — A)+ is the soft-thresholding operator in Table 3.4 
on page 71. The first argument to S(-) is the simple least-squares coefficient 
of the partial residual on the standardized variable Xij. Repeated iteration 
of (3.84)—cycling through each variable in turn until convergence—yields 
the lasso estimate /3(A). 

We can also use this simple algorithm to efficiently compute the lasso 
solutions at a grid of values of A. We start with the smallest value A max 
for which /3(A max ) = 0, decrease it a little and cycle through the variables 
until convergence. Then A is decreased again and the process is repeated, 
using the previous solution as a “warm start” for the new value of A. This 
can be faster than the LARS algorithm, especially in large problems. A 
key to its speed is the fact that the quantities in (3.84) can be updated 
quickly as j varies, and often the update is to leave /3 ? =0. On the other 
hand, it delivers solutions over a grid of A values, rather than the entire 
solution path. The same kind of algorithm can be applied to the elastic 
net, the grouped lasso and many other models in which the penalty is a 
sum of functions of the individual parameters (Friedman et al., 2010). It 
can also be applied, with some substantial modifications, to the fused lasso 
(Section 18.4.2); details are in Friedman et al. (2007). 


3.9 Computational Considerations 

Least squares fitting is usually done via the Cholesky decomposition of 
the matrix X T X or a QR decomposition of X. With N observations and p 
features, the Cholesky decomposition requires p 3 + Np 2 /2 operations, while 
the QR decomposition requires Np 2 operations. Depending on the relative 
size of N and p, the Cholesky can sometimes be faster; on the other hand, 
it can be less numerically stable (Lawson and Hansen, 1974). Computation 
of the lasso via the LAR algorithm has the same order of computation as 
a least squares fit. 
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Bibliographic Notes 

Linear regression is discussed in many statistics books, for example, Seber 
(1984), Weisberg (1980) and Mardia et al. (1979). Ridge regression was 
introduced by Hoerl and Kennard (1970), while the lasso was proposed by 
Tibshirani (1996). Around the same time, lasso-type penalties were pro¬ 
posed in the basis pursuit method for signal processing (Chen et ah, 1998). 
The least angle regression procedure was proposed in Efron et al. (2004); 
related to this is the earlier homotopy procedure of Osborne et al. (2000a) 
and Osborne et al. (2000b). Their algorithm also exploits the piecewise 
linearity used in the LAR/lasso algorithm, but lacks its transparency. The 
criterion for the forward stagewise criterion is discussed in Hastie et al. 
(2007). Park and Hastie (2007) develop a path algorithm similar to least 
angle regression for generalized regression models. Partial least squares 
was introduced by Wold (1975). Comparisons of shrinkage methods may 
be found in Copas (1983) and Frank and Friedman (1993). 


Exercises 


Ex. 3.1 Show that the F statistic (3.13) for dropping a single coefficient 
from a model is equal to the square of the corresponding 2 -score (3.12). 

Ex. 3.2 Given data on two variables X and Y, consider fitting a cubic 
polynomial regression model f(X) = Ej= 0 /3jAA. j n addition to plotting 
the fitted curve, you would like a 95% confidence band about the curve. 
Consider the following two approaches: 

1. At each point Xq, form a 95% confidence interval for the linear func¬ 
tion a T /3 = Y?j=o Pj x o- 

2. Form a 95% confidence set for (3 as in (3.15), which in turn generates 
confidence intervals for f(x o). 

How do these approaches differ? Which band is likely to be wider? Conduct 
a small simulation experiment to compare the two methods. 

Ex. 3.3 Gauss-Markov theorem: 

(a) Prove the Gauss-Markov theorem: the least squares estimate of a 
parameter a T /3 has variance no bigger than that of any other linear 
unbiased estimate of a T /3 (Section 3.2.2). 

(b) The matrix inequality B X A holds if A — B is positive semidefinite. 

Show that if V is the variance-covariance matrix of the least squares 
estimate of f) and V is the variance-covariance matrix of any other 
linear unbiased estimate, then V -< V. 


Exercises 


95 


Ex. 3.4 Show how the vector of least squares coefficients can be obtained 
from a single pass of the Gram-Schmidt procedure (Algorithm 3.1). Rep¬ 
resent your solution in terms of the QR decomposition of X. 

Ex. 3.5 Consider the ridge regression problem (3.41). Show that this prob¬ 
lem is equivalent to the problem 

{ N p p 'j 

X [j/i ~Po~ ~ *o)Pf\ 2 + A X 2 f ■ ( 3 - 85 ) 

i= 1 3= 1 3=1 J 

Give the correspondence between /3 C and the original /3 in (3.41). Char¬ 
acterize the solution to this modified criterion. Show that a similar result 
holds for the lasso. 

Ex. 3.6 Show that the ridge regression estimate is the mean (and mode) 
of the posterior distribution, under a Gaussian prior /3 ~ 1V(0 ,tI), and 
Gaussian sampling model y ~ iV(X/3, cr 2 I). Find the relationship between 
the regularization parameter A in the ridge formula, and the variances r 
and a 2 . 

Ex. 3.7 Assume yi ~ 7V(/J 0 + xffi, c 2 ), * = 1,2,..., TV, and the parameters 
/3j, j = 1 are each distributed as fV(0,r 2 ), independently of one 

another. Assuming a 2 and r 2 are known, and /3q is not governed by a 
prior (or has a flat improper prior), show that the (minus) log-posterior 
density of /3 is proportional to J2iLi(Vi ~ Po ~ J2j x ijPj) 2 + x T, P j=iPj 
where A = cr 2 /r 2 . 

Ex. 3.8 Consider the QR decomposition of the uncentered N x (jp + 1) 
matrix X (whose first column is all ones), and the SVD of the N x p 
centered matrix X. Show that Q 2 and U span the same subspace, where 
Q2 is the sub-matrix of Q with the first column removed. Under what 
circumstances will they be the same, up to sign flips? 

Ex. 3.9 Forward stepwise regression. Suppose we have the QR decomposi¬ 
tion for the N xq matrix Xi in a multiple regression problem with response 
y, and we have an additional p— q predictors in the matrix X 2 . Denote the 
current residual by r. We wish to establish which one of these additional 
variables will reduce the residual-sum-of squares the most when included 
with those in Xi. Describe an efficient procedure for doing this. 

Ex. 3.10 Backward stepwise regression. Suppose we have the multiple re¬ 
gression fit of y on X, along with the standard errors and Z-scores as in 
Table 3.2. We wish to establish which variable, when dropped, will increase 
the residual sum-of-squares the least. How would you do this? 

Ex. 3.11 Show that the solution to the multivariate linear regression prob¬ 
lem (3.40) is given by (3.39). What happens if the covariance matrices S, 
are different for each observation? 
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Ex. 3.12 Show that the ridge regression estimates can be obtained by 
ordinary least squares regression on an augmented data set. We augment 
the centered matrix X with p additional rows a/AI, and augment y with p 
zeros. By introducing artificial data having response value zero, the fitting 
procedure is forced to shrink the coefficients toward zero. This is related to 
the idea of hints due to Abu-Mostafa (1995), where model constraints are 
implemented by adding artificial data examples that satisfy them. 

Ex. 3.13 Derive the expression (3.62), and show that /3 pcr (p) = /3 ls . 

Ex. 3.14 Show that in the orthogonal case, PLS stops after m = 1 steps, 
because subsequent <p m j in step 2 in Algorithm 3.3 are zero. 

Ex. 3.15 Verify expression (3.64), and hence show that the partial least 
squares directions are a compromise between the ordinary regression coef¬ 
ficient and the principal component directions. 

Ex. 3.16 Derive the entries in Table 3.4, the explicit forms for estimators 
in the orthogonal case. 

Ex. 3.17 Repeat the analysis of Table 3.3 on the spam data discussed in 
Chapter 1. 

Ex. 3.18 Read about conjugate gradient algorithms (Murray et al., 1981, for 
example), and establish a connection between these algorithms and partial 
least squares. 

Ex. 3.19 Show that ||/3 rldge || increases as its tuning parameter A —> 0. Does 
the same property hold for the lasso and partial least squares estimates? 
For the latter, consider the “tuning parameter” to be the successive steps 
in the algorithm. 

Ex. 3.20 Consider the canonical-correlation problem (3.67). Show that the 
leading pair of canonical variates u\ and V\ solve the problem 

max u T ( Y T X)u, (3.86) 

u T (Y t Y)u = 1 
»t(X T X)»= 1 

a generalized SVD problem. Show that the solution is given by u\ = 
(Y T Y)-5 U j\ and v\ = ( 'X T X.)~^vl , where u* and vf are the leading left 
and right singular vectors in 

(Y t Y)-3(Y t X)(X t X)-5 =U*D*V* t . (3.87) 

Show that the entire sequence u m , v m , m = 1,..., min(Af, p) is also given 
by (3.87). 

Ex. 3.21 Show that the solution to the reduced-rank regression problem 
(3.68), with E estimated by Y T Y/N, is given by (3.69). Hint: Transform 
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Y to Y* = YE 2 , and solved in terms of the canonical vectors u^. Show 
that U m = E -2 !^, and a generalized inverse is U“ = U^ T E 2 . 

Ex. 3.22 Show that the solution in Exercise 3.21 does not change if £ is 
estimated by the more natural quantity (Y — XB) T (Y — XB)/(iV — pK). 

Ex. 3.23 Consider a regression problem with all variables and response hav¬ 
ing mean zero and standard deviation one. Suppose also that each variable 
has identical absolute correlation with the response: 

^l( x i>y)l = \ 3 = 1, •••,?>■ 

Let $ be the least-squares coefficient of y on X, and let u(a) = aX/3 for 
a € [0,1] be the vector that moves a fraction a toward the least squares fit 
u. Let RSS be the residual sum-of-squares from the full least squares fit. 

(a) Show that 

-^|( x j,y- u (a)}| = (1 — a)A, j = l,...,p, 

and hence the correlations of each x, with the residuals remain equal 
in magnitude as we progress toward u. 

(b) Show that these correlations are all equal to 

- 9 -“> .A. 

1 - a) 2 + ■ RSS 

and hence they decrease monotonically to zero. 

(c) Use these results to show that the LAR algorithm in Section 3.4.4 
keeps the correlations tied and monotonically decreasing, as claimed 
in (3.55). 

Ex. 3.24 LAR directions. Using the notation around equation (3.55) on 
page 74, show that the LAR direction makes an equal angle with each of 
the predictors in Ak- 

Ex. 3.25 LAR look-ahead (Efron et al., 2004, <Sec. %)■ Starting at the be¬ 
ginning of the fcth step of the LAR algorithm, derive expressions to identify 
the next variable to enter the active set at step k + 1, and the value of a at 
which this occurs (using the notation around equation (3.55) on page 74). 

Ex. 3.26 Forward stepwise regression enters the variable at each step that 
most reduces the residual sum-of-squares. LAR adjusts variables that have 
the most (absolute) correlation with the current residuals. Show that these 
two entry criteria are not necessarily the same. [Hint: let Xj .a be the jth 
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variable, linearly adjusted for all the variables currently in the model. Show 
that the first criterion amounts to identifying the j for which Cor(xy_ 4 ,r) 
is largest in magnitude. 

Ex. 3.27 Lasso and LAR: Consider the lasso problem in Lagrange multiplier 
form: with L{0) = | YliiVi ~ x ijPj) 2 i we minimize 

L(l3) + XY / Wi\ (3-88) 

j 


for fixed A > 0. 

(a) Setting 0j = 0^ — 0~ with 0f , 0J > 0, expression (3.88) becomes 

L{0 ) + A + /3j”)- Show that the Lagrange dual function is 

m +a E^ + + pj) - E W ~ E A 7 p? (3-89) 

3 3 3 

and the Karush Kuhn-Tucker optimality conditions are 

VL(P)j + A-At = 0 
-VL{p)j+ A —A" = 0 

W = 0 

A JPJ = 

along with the non-negativity constraints on the parameters and all 
the Lagrange multipliers. 

(b) Show that |VL(/3)jj < A Vj, and that the KKT conditions imply one 
of the following three scenarios: 

A = 0 => VL{p)j = 0 Vj 

,6+ >0, A > 0 => At = 0, VL(d)j = -A < 0, /3~ = 0 

Pj > 0, A > 0 =► AJ = 0, VL(/3),- = A > 0, pf = 0. 

Hence show that for any “active” predictor having 0 3 7 ^ 0, we must 
have VL(f3)j = —A if 0j > 0, and S7L{0)j = A if 0j < 0. Assuming 
the predictors are standardized, relate A to the correlation between 
the jth predictor and the current residuals. 

(c) Suppose that the set of active predictors is unchanged for A 0 > A > Ai. 

Show that there is a vector 70 such that 

/3(A) = P(X 0 ) - (A - Aobo (3.90) 

Thus the lasso solution path is linear as A ranges from Ao to Ai (Efron 
et ah, 2004; Rosset and Zhu, 2007). 


Exercises 


99 


Ex. 3.28 Suppose for a given t in (3.51), the fitted lasso coefficient for 
variable Xj is (3j = a. Suppose we augment our set of variables with an 
identical copy X* = Xj. Characterize the effect of this exact collinearity 
by describing the set of solutions for 8j and /3*, using the same value of t. 

Ex. 3.29 Suppose we run a ridge regression with parameter A on a single 
variable X , and get coefficient a. We now include an exact copy X* = X, 
and refit our ridge regression. Show that both coefficients are identical, and 
derive their value. Show in general that if m copies of a variable Xj are 
included in a ridge regression, their coefficients are all the same. 

Ex. 3.30 Consider the elastic-net optimization problem: 

inin ||y - X/3|| 2 + X[a\\(3\\l + (1 - a)||/3||i]. (3.91) 

Show how one can turn this into a lasso problem, using an augmented 
version of X and y. 
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4 

Linear Methods for Classification 


This is page 101 
Printer: Opaque this 


4.1 Introduction 

In this chapter we revisit the classification problem and focus on linear 
methods for classification. Since our predictor G(x ) takes values in a dis¬ 
crete set Q, we can always divide the input space into a collection of regions 
labeled according to the classification. We saw in Chapter 2 that the bound¬ 
aries of these regions can be rough or smooth, depending on the prediction 
function. For an important class of procedures, these decision boundaries 
are linear; this is what we will mean by linear methods for classification. 

There are several different ways in which linear decision boundaries can 
be found. In Chapter 2 we fit linear regression models to the class indicator 
variables, and classify to the largest fit. Suppose there are K classes, for 
convenience labeled 1,2,... ,K, and the fitted linear model for the £:th 
indicator response variable is fk{x) = $ko + Pk x - The decision boundary 
between class k and £ is that set of points for which fk(x) = fe(x), that is, 
the set {x : (fika — $eo) + (Pk — $e) T x = 0}, an affine set or hyperplane. 1 
Since the same is true for any pair of classes, the input space is divided 
into regions of constant classification, with piecewise hyperplanar decision 
boundaries. This regression approach is a member of a class of methods 
that model discriminant functions dk(x) for each class, and then classify x 
to the class with the largest value for its discriminant function. Methods 


1 Strictly speaking, a hyperplane passes through the origin, while an affine set need 
not. We sometimes ignore the distinction and refer in general to hyperplanes. 
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that model the posterior probabilities Pr(G = k\X = x) are also in this 
class. Clearly, if either the Sk{x) or Pr(G = k\X = x) are linear in x, then 
the decision boundaries will be linear. 

Actually, all we require is that some monotone transformation of 5k or 
Pr(G = k\X = x ) be linear for the decision boundaries to be linear. For 
example, if there are two classes, a popular model for the posterior proba¬ 
bilities is 


Pr(G = 1|X = x) 
Pr(G = 2\X = x ) 


exp(/3 0 + P T x) 

1 + exp(/? 0 + P T x) ’ 

1 

1 + exp(/? 0 + l3 T x )' 


(4.1) 


Here the monotone transformation is the logit transformation: log[p/(l—p)], 
and in fact we see that 


, Pr(G = 1\X = x) 
° S Pr(G = 2\X = x) 


/3 2 x. 


(4.2) 


The decision boundary is the set of points for which the log-odds are zero, 
and this is a hyperplane defined by {x\/3 0 + (3 T x = 0}. We discuss two very 
popular but different methods that result in linear log-odds or logits: linear 
discriminant analysis and linear logistic regression. Although they differ in 
their derivation, the essential difference between them is in the way the 
linear function is fit to the training data. 

A more direct approach is to explicitly model the boundaries between 
the classes as linear. For a two-class problem in a p-dimensional input 
space, this amounts to modeling the decision boundary as a hyperplane—in 
other words, a normal vector and a cut-point. We will look at two methods 
that explicitly look for “separating hyperplanes.” The first is the well- 
known perceptron model of Rosenblatt (1958), with an algorithm that finds 
a separating hyperplane in the training data, if one exists. The second 
method, due to Vapnik (1996), finds an optimally separating hyperplane if 
one exists, else finds a hyperplane that minimizes some measure of overlap 
in the training data. We treat the separable case here, and defer treatment 
of the nonseparable case to Chapter 12. 

While this entire chapter is devoted to linear decision boundaries, there is 
considerable scope for generalization. For example, we can expand our vari¬ 
able set X\,.. ., X p by including their squares and cross-products X% , Xf,..., 
X 1 X 2 , ■ .., thereby adding p(p+ l)/2 additional variables. Linear functions 
in the augmented space map down to quadratic functions in the original 
space—hence linear decision boundaries to quadratic decision boundaries. 
Figure 4.1 illustrates the idea. The data are the same: the left plot uses 
linear decision boundaries in the two-dimensional space shown, while the 
right plot uses linear decision boundaries in the augmented five-dimensional 
space described above. This approach can be used with any basis transfor- 
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FIGURE 4.1. The left plot shows some data from three classes, with linear 
decision boundaries found by linear discriminant analysis. The right plot shows 
quadratic decision boundaries. These were obtained by finding linear boundaries 
in the five-dimensional space X\, X?, X 1 X 2 , X \, X\. Linear inequalities in this 
space are quadratic inequalities in the original space. 


mation h(X ) where h : IR P 1 —>■ IR 9 with q > p, and will be explored in later 
chapters. 


4.2 Linear Regression of an Indicator Matrix 

Here each of the response categories are coded via an indicator variable. 
Thus if Q has K classes, there will be K such indicators Y*,, k = 1,..., K, 
with Yfc = 1 if G = k else 0. These are collected together in a vector 
Y = (Yi,... ,Yk ), and the N training instances of these form an N x K 
indicator response matrix Y. Y is a matrix of 0’s and l’s, with each row 
having a single 1. We fit a linear regression model to each of the columns 
of Y simultaneously, and the fit is given by 

Y = X(X t X)" 1 X t Y. (4.3) 

Chapter 3 has more details on linear regression. Note that we have a coeffi¬ 
cient vector for each response column y^, and hence a (p+1) x K coefficient 
matrix B = (X T X) - 1 X T Y. Here X is the model matrix withp+1 columns 
corresponding to the p inputs, and a leading column of l’s for the intercept. 
A new observation with input x is classified as follows: 

• compute the fitted output f(x) T = (l,x T )B. a K vector; 

• identify the largest component and classify accordingly: 

G(x) = argmax fcg 0 / fc (a;). 


(4.4) 
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What is the rationale for this approach? One rather formal justification 
is to view the regression as an estimate of conditional expectation. For the 
random variable Y k: E(Y k \X = x) = Pr(G = k\X = x ), so conditional 
expectation of each of the Y k seems a sensible goal. The real issue is: how 
good an approximation to conditional expectation is the rather rigid linear 
regression model? Alternatively, are the f k (x) reasonable estimates of the 
posterior probabilities Pr(G = k\X = x ), and more importantly, does this 
matter? 

It is quite straightforward to verify that fk{x) = 1 for any x, as 

long as there is an intercept in the model (column of l’s in X). However, 
the fk(x) can be negative or greater than 1, and typically some are. This 
is a consequence of the rigid nature of linear regression, especially if we 
make predictions outside the hull of the training data. These violations in 
themselves do not guarantee that this approach will not work, and in fact 
on many problems it gives similar results to more standard linear meth¬ 
ods for classification. If we allow linear regression onto basis expansions 
h{X) of the inputs, this approach can lead to consistent estimates of the 
probabilities. As the size of the training set N grows bigger, we adaptively 
include more basis elements so that linear regression onto these basis func¬ 
tions approaches conditional expectation. We discuss such approaches in 
Chapter 5. 

A more simplistic viewpoint is to construct targets tk for each class, 
where t k is the fcth column of the K x K identity matrix. Our prediction 
problem is to try and reproduce the appropriate target for an observation. 
With the same coding as before, the response vector iji (ith row of Y) for 
observation i has the value yt = tk if g% = k. We might then fit the linear 
model by least squares: 


N 

W yi ~ [C 1 ’^) 5 ] 7 !! 2 - ( 4 -5) 

i=1 

The criterion is a sum-of-squared Euclidean distances of the fitted vectors 
from their targets. A new observation is classified by computing its fitted 
vector f(x) and classifying to the closest target: 

G(x) = argmin ||/(a;) - t k || 2 . (4.6) 

k 

This is exactly the same as the previous approach: 

• The sum-of-squared-norm criterion is exactly the criterion for multi¬ 
ple response linear regression, just viewed slightly differently. Since 
a squared norm is itself a sum of squares, the components decouple 
and can be rearranged as a separate linear model for each element. 
Note that this is only possible because there is nothing in the model 
that binds the different responses together. 


4.2 Linear Regression of an Indicator Matrix 


105 


Linear Regression Linear Discriminant Analysis 


Xi 

FIGURE 4.2. The data come from three classes in IR 2 and are easily separated 
by linear decision boundaries. The right plot shows the boundaries found by linear 
discriminant analysis. The left plot shows the boundaries found by linear regres¬ 
sion of the indicator response variables. The middle class is completely masked 
(never dominates). 




• The closest target classification rule (4.6) is easily seen to be exactly 
the same as the maximum fitted component criterion (4.4). 

There is a serious problem with the regression approach when the number 
of classes K > 3, especially prevalent when K is large. Because of the rigid 
nature of the regression model, classes can be masked by others. Figure 4.2 
illustrates an extreme situation when K = 3. The three classes are perfectly 
separated by linear decision boundaries, yet linear regression misses the 
middle class completely. 

In Figure 4.3 we have projected the data onto the line joining the three 
centroids (there is no information in the orthogonal direction in this case), 
and we have included and coded the three response variables Y \, Y 2 and 
Y 3 . The three regression lines (left panel) are included, and we see that 
the line corresponding to the middle class is horizontal and its fitted values 
are never dominant! Thus, observations from class 2 are classified either 
as class 1 or class 3. The right panel uses quadratic regression rather than 
linear regression. For this simple example a quadratic rather than linear 
fit (for the middle class at least) would solve the problem. However, it 
can be seen that if there were four rather than three classes lined up like 
this, a quadratic would not come down fast enough, and a cubic would 
be needed as well. A loose but general rule is that if K > 3 classes are 
lined up, polynomial terms up to degree K — 1 might be needed to resolve 
them. Note also that these are polynomials along the derived direction 
passing through the centroids, which can have arbitrary orientation. So in 
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Degree = 1; Error = 0.33 
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FIGURE 4.3. The effects of masking on linear regression in 1R for a three-class 
problem. The rug plot at the base indicates the positions and class membership of 
each observation. The three curves in each panel are the fitted regressions to the 
three-class indicator variables; for example, for the blue class, ybiue is 1 for the 
blue observations, and 0 for the green and orange. The fits are linear and quadratic 
polynomials. Above each plot is the training error rate. The Bayes error rate is 
0.025 for this problem, as is the LDA error rate. 


p-dimensional input space, one would need general polynomial terms and 
cross-products of total degree K — 1, 0{p K ~ 1 ) terms in all, to resolve such 
worst-case scenarios. 

The example is extreme, but for large K and small p such maskings 
naturally occur. As a more realistic illustration, Figure 4.4 is a projection 
of the training data for a vowel recognition problem onto an informative 
two-dimensional subspace. There are K = 11 classes in p = 10 dimensions. 
This is a difficult classification problem, and the best methods achieve 
around 40% errors on the test data. The main point here is summarized in 
Table 4.1; linear regression has an error rate of 67%, while a close relative, 
linear discriminant analysis, has an error rate of 56%. It seems that masking 
has hurt in this case. While all the other methods in this chapter are based 
on linear functions of x as well, they use them in such a way that avoids 
this masking problem. 


4.3 Linear Discriminant Analysis 

Decision theory for classification (Section 2.4) tells us that we need to know 
the class posteriors Pr(G|X) for optimal classification. Suppose /*,( x) is 
the class-conditional density of X in class G = k, and let 7be the prior 
probability of class k, with = 1. A simple application of Bayes 
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Linear Discriminant Analysis 
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FIGURE 4.4. A two-dimensional plot of the vowel training data. There are 
eleven classes with X £ IR 10 , and this is the best view in terms of a LDA model 
(Section 4-3-3). The heavy circles are the projected mean vectors for each class. 
The class overlap is considerable. 


TABLE 4.1. Training and test error rates using a variety of linear techniques 
on the vowel data. There are eleven classes in ten dimensions, of which three 
account for 90% of the variance (via a principal components analysis). We see 
that linear regression is hurt by masking, increasing the test and training error 
by over 10%. 


Technique 

Error Rates 
Training Test 

Linear regression 

0.48 

0.67 

Linear discriminant analysis 

0.32 

0.56 

Quadratic discriminant analysis 

0.01 

0.53 

Logistic regression 

0.22 

0.51 
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theorem gives us 

Pr(G = fc|X = z) = f K k{X ^\ ■ (4.7) 

Lfci fe{x)n t 

We see that in terms of ability to classify, having the /*. (x) is almost equiv¬ 
alent to having the quantity Pr(G = k\X = x). 

Many techniques are based on models for the class densities: 


• linear and quadratic discriminant analysis use Gaussian densities; 

• more flexible mixtures of Gaussians allow for nonlinear decision bound¬ 
aries (Section 6.8); 

• general nonparametric density estimates for each class density allow 
the most flexibility (Section 6.6.2); 

• Naive Bayes models are a variant of the previous case, and assume 
that each of the class densities are products of marginal densities; 
that is, they assume that the inputs are conditionally independent in 
each class (Section 6.6.3). 

Suppose that we model each class density as multivariate Gaussian 


fk(x) 


___ p-^(x-/j,ic) T 'Z k 1 (x-Vk) 

(27r) p / 2 |Sfc| 1 / 2 


(4.8) 


Linear discriminant analysis (LDA) arises in the special case when we 
assume that the classes have a common covariance matrix S*, = £ Vfc. In 
comparing two classes k and £, it is sufficient to look at the log-ratio, and 
we see that 


log r*(G = k\X = x) = h(x) 7T, 
g Pr(G = e\X = x ) g fe(x) + S n t 


= log — - ~{nk +m) t £ 1 (n k -ne) 

7T.£ A 

Jv-1 




(4.9) 


an equation linear in x. The equal covariance matrices cause the normal¬ 
ization factors to cancel, as well as the quadratic part in the exponents. 
This linear log-odds function implies that the decision boundary between 
classes k and i —the set where Pr(G = k\X = x) = Pr(G = i\X = x )—is 
linear in x ; in p dimensions a hyperplane. This is of course true for any pair 
of classes, so all the decision boundaries are linear. If we divide 1R P into 
regions that are classified as class 1, class 2, etc., these regions will be sep¬ 
arated by hyperplanes. Figure 4.5 (left panel) shows an idealized example 
with three classes and p = 2. Here the data do arise from three Gaus¬ 
sian distributions with a common covariance matrix. We have included in 
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FIGURE 4.5. The left panel shows three Gaussian distributions, with the same 
covariance and different means. Included are the contours of constant density 
enclosing 95% of the probability in each case. The Bayes decision boundaries 
between each pair of classes are shown (broken straight lines), and the Bayes 
decision boundaries separating all three classes are the thicker solid lines (a subset 
of the former). On the right we see a sample of 30 drawn from each Gaussian 
distribution, and the fitted LDA decision boundaries. 


the figure the contours corresponding to 95% highest probability density, 
as well as the class centroids. Notice that the decision boundaries are not 
the perpendicular bisectors of the line segments joining the centroids. This 
would be the case if the covariance £ were spherical er 2 I, and the class 
priors were equal. From (4.9) we see that the linear discriminant functions 

S k (x) = z T £ - Vfc - -jUfcSrVfc + logTifc (4.10) 

are an equivalent description of the decision rule, with G(x) = argmax fc <5fc(a;). 

In practice we do not know the parameters of the Gaussian distributions, 
and will need to estimate them using our training data: 

• 7ife = Nk/N, where Nk is the number of class-fc observations; 

• Afc = T, gi =k x i/ N kl 

• S = J2k=i E 3l =fcOi - Afc)0u - Tk) T /(N - K). 

Figure 4.5 (right panel) shows the estimated decision boundaries based on 
a sample of size 30 each from three Gaussian distributions. Figure 4.1 on 
page 103 is another example, but here the classes are not Gaussian. 

With two classes there is a simple correspondence between linear dis¬ 
criminant analysis and classification by linear regression, as in (4.5). The 
LDA rule classifies to class 2 if 

x T £, 1 (A2-Ai) > ^(A 2 + Ai) T S - Ai) - log(AT 2 /ATi), (4.11) 
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and class 1 otherwise. Suppose we code the targets in the two classes as +1 
and —1, respectively. It is easy to show that the coefficient vector from least 
squares is proportional to the LDA direction given in (4.11) (Exercise 4.2). 
[In fact, this correspondence occurs for any (distinct) coding of the targets; 
see Exercise 4.2], However unless Aq = N 2 the intercepts are different and 
hence the resulting decision rules are different. 

Since this derivation of the LDA direction via least squares does not use a 
Gaussian assumption for the features, its applicability extends beyond the 
realm of Gaussian data. However the derivation of the particular intercept 
or cut-point given in (4.11) does require Gaussian data. Thus it makes 
sense to instead choose the cut-point that empirically minimizes training 
error for a given dataset. This is something we have found to work well in 
practice, but have not seen it mentioned in the literature. 

With more than two classes, LDA is not the same as linear regression of 
the class indicator matrix, and it avoids the masking problems associated 
with that approach (Hastie et al., 1994). A correspondence between regres¬ 
sion and LDA can be established through the notion of optimal scoring, 
discussed in Section 12.5. 

Getting back to the general discriminant problem (4.8), if the are 
not assumed to be equal, then the convenient cancellations in (4.9) do not 
occur; in particular the pieces quadratic in x remain. We then get quadratic 
discriminant functions (QDA), 

h{x) = -^ log |E fc | - ^(x - p, k ) T 'Sf 1 {x - /x fc ) + logTTfe. (4.12) 

The decision boundary between each pair of classes k and £ is described by 
a quadratic equation {x : Sk{x) = 6e(x)}. 

Figure 4.6 shows an example (from Figure 4.1 on page 103) where the 
three classes are Gaussian mixtures (Section 6.8) and the decision bound¬ 
aries are approximated by quadratic equations in x. Here we illustrate 
two popular ways of fitting these quadratic boundaries. The right plot 
uses QDA as described here, while the left plot uses LDA in the enlarged 
five-dimensional quadratic polynomial space. The differences are generally 
small; QDA is the preferred approach, with the LDA method a convenient 
substitute 2 . 

The estimates for QDA are similar to those for LDA, except that separate 
covariance matrices must be estimated for each class. When p is large this 
can mean a dramatic increase in parameters. Since the decision boundaries 
are functions of the parameters of the densities, counting the number of 
parameters must be done with care. For LDA, it seems there are (K — 
1) x (p + 1) parameters, since we only need the differences 8k{x) — Sk(x) 


2 For this figure and many similar figures in the book we compute the decision bound¬ 
aries by an exhaustive contouring method. We compute the decision rule on a fine lattice 
of points, and then use contouring algorithms to compute the boundaries. 
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FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows 
the quadratic decision boundaries for the data in Figure f.l (obtained using LDA 
in the five-dimensional space Xi, X 2 , X 1 X 2 , Xf, X\). The right plot shows the 
quadratic decision boundaries found by QDA. The differences are small, as is 
usually the case. 


between the discriminant functions where K is some pre-chosen class (here 
we have chosen the last), and each difference requires p + 1 parameters 3 . 
Likewise for QDA there will be (K — 1) x {p(p + 3)/2 + 1} parameters. 
Both LDA and QDA perform well on an amazingly large and diverse set 
of classification tasks. For example, in the STATLOG project (Michie et 
al., 1994) LDA was among the top three classifiers for 7 of the 22 datasets, 
QDA among the top three for four datasets, and one of the pair were in the 
top three for 10 datasets. Both techniques are widely used, and entire books 
are devoted to LDA. It seems that whatever exotic tools are the rage of the 
day, we should always have available these two simple tools. The question 
arises why LDA and QDA have such a good track record. The reason is not 
likely to be that the data are approximately Gaussian, and in addition for 
LDA that the covariances are approximately equal. More likely a reason is 
that the data can only support simple decision boundaries such as linear or 
quadratic, and the estimates provided via the Gaussian models are stable. 
This is a bias variance tradeoff -we can put up with the bias of a linear 
decision boundary because it can be estimated with much lower variance 
than more exotic alternatives. This argument is less believable for QDA, 
since it can have many parameters itself, although perhaps fewer than the 
non-parametric alternatives. 


’Although we fit the covariance matrix £ to compute the LDA discriminant functions, 
a much reduced function of it is all that is required to estimate the O(p) parameters 
needed to compute the decision boundaries. 
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Regularized Discriminant Analysis on the Vowel Data 
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FIGURE 4.7. Test and training errors for the vowel data, using regularized 
discriminant analysis with a series of values of a £ [0,1], The optimum for the 
test data occurs around a = 0.9, close to quadratic discriminant analysis. 


4-3.1 Regularized Discriminant Analysis 

Friedman (1989) proposed a compromise between LDA and QDA, which 
allows one to shrink the separate covariances of QDA toward a common 
covariance as in LDA. These methods are very similar in flavor to ridge 
regression. The regularized covariance matrices have the form 

S fc (a) = crSfc + (1 — a)S, (4-13) 

where X is the pooled covariance matrix as used in LDA. Here a £ [0,1] 
allows a continuum of models between LDA and QDA, and needs to be 
specified. In practice a can be chosen based on the performance of the 
model on validation data, or by cross-validation. 

Figure 4.7 shows the results of RDA applied to the vowel data. Both 
the training and test error improve with increasing a , although the test 
error increases sharply after a = 0.9. The large discrepancy between the 
training and test error is partly due to the fact that there are many repeat 
measurements on a small number of individuals, different in the training 
and test set. 

Similar modifications allow X itself to be shrunk toward the scalar 
covariance, 

X( 7 ) = 7 X + (1 - 7 )<t 2 I (4.14) 

for 7 £ [0,1]. Replacing X in (4.13) by X( 7 ) leads to a more general family 
of covariances X(a, 7 ) indexed by a pair of parameters. 

In Chapter 12, we discuss other regularized versions of LDA, which are 
more suitable when the data arise from digitized analog signals and images. 
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In these situations the features are high-dimensional and correlated, and the 
LDA coefficients can be regularized to be smooth or sparse in the original 
domain of the signal. This leads to better generalization and allows for 
easier interpretation of the coefficients. In Chapter 18 we also deal with 
very high-dimensional problems, where for example the features are gene- 
expression measurements in microarray studies. There the methods focus 
on the case 7 = 0 in (4.14), and other severely regularized versions of LDA. 

4-3.2 Computations for LDA 

As a lead-in to the next topic, we briefly digress on the computations 
required for LDA and especially QDA. Their computations are simplified 
by diagonalizing S or £*,. For the latter, suppose we compute the eigen- 
decomposition for each = UfcDfcU^, where U k is p x p orthonormal, 
and Dfc a diagonal matrix of positive eigenvalues dki- Then the ingredients 
for Sk{x) (4.12) are 

. (a - f k ) T K\x - M = [Vl(x - - /}*)]; 

• log |S fc | = 

In light of the computational steps outlined above, the LDA classifier 
can be implemented by the following pair of steps: 

• Sphere the data with respect to the common covariance estimate X: 
X* <- D^ 5 U t AA where S = UDU T . The common covariance esti¬ 
mate of X* will now be the identity. 

• Classify to the closest class centroid in the transformed space, modulo 
the effect of the class prior probabilities 7 Tfc. 


4-3.3 Reduced-Rank Linear Discriminant Analysis 

So far we have discussed LDA as a restricted Gaussian classifier. Part of 
its popularity is due to an additional restriction that allows us to view 
informative low-dimensional projections of the data. 

The K centroids in p-dimensional input space lie in an affine subspace 
of dimension < K — 1, and if p is much larger than K , this will be a con¬ 
siderable drop in dimension. Moreover, in locating the closest centroid, we 
can ignore distances orthogonal to this subspace, since they will contribute 
equally to each class. Thus we might just as well project the X* onto this 
centroid-spanning subspace Hk-i, and make distance comparisons there. 
Thus there is a fundamental dimension reduction in LDA, namely, that we 
need only consider the data in a subspace of dimension at most K — 1. 
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If K = 3, for instance, this could allow us to view the data in a two- 
dimensional plot, color-coding the classes. In doing so we would not have 
relinquished any of the information needed for LDA classification. 

What if K > 3? We might then ask for a L < K — 1 dimensional subspace 
H l C Hk—i optimal for LDA in some sense. Fisher defined optimal to 
mean that the projected centroids were spread out as much as possible in 
terms of variance. This amounts to finding principal component subspaces 
of the centroids themselves (principal components are described briefly in 
Section 3.5.1, and in more detail in Section 14.5.1). Figure 4.4 shows such an 
optimal two-dimensional subspace for the vowel data. Here there are eleven 
classes, each a different vowel sound, in a ten-dimensional input space. The 
centroids require the full space in this case, since K — 1 = p, but we have 
shown an optimal two-dimensional subspace. The dimensions are ordered, 
so we can compute additional dimensions in sequence. Figure 4.8 shows four 
additional pairs of coordinates, also known as canonical or discriminant 
variables. In summary then, finding the sequences of optimal subspaces 
for LDA involves the following steps: 

• compute the K x p matrix of class centroids M and the common 
covariance matrix W (for within-class covariance); 

• compute M* = MVWs using the eigen-decomposition of W; 

• compute B*, the covariance matrix of M* (B for between-class covari¬ 
ance), and its eigen-deconrposition B* = V*DbV* t . The columns 

of V* in sequence from first to last define the coordinates of the 
optimal subspaces. 

Combining all these operations the £th discriminant variable is given by 
Zi = vjX with Vi = 

Fisher arrived at this decomposition via a different route, without refer¬ 
ring to Gaussian distributions at all. He posed the problem: 

Find the linear combination Z = a T X such that the between- 
class variance is maximized relative to the within-class variance. 

Again, the between class variance is the variance of the class means of 
Z, and the within class variance is the pooled variance about the means. 
Figure 4.9 shows why this criterion makes sense. Although the direction 
joining the centroids separates the means as much as possible (i.e., max¬ 
imizes the between-class variance), there is considerable overlap between 
the projected classes due to the nature of the covariances. By taking the 
covariance into account as well, a direction with minimum overlap can be 
found. 

The between-class variance of Z is a T Ba and the within-class variance 
a T Wa, where W is defined earlier, and B is the covariance matrix of the 
class centroid matrix M. Note that B + W = T, where T is the total 
covariance matrix of X , ignoring class information. 


Coordinate 7 Coordinate 3 
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Coordinate 1 


Coordinate 9 


FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as 
the rank of the canonical variates increases, the centroids become less spread out. 
In the lower right panel they appear to be superimposed, and the classes most 
confused. 
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FIGURE 4.9. Although the line joining the centroids defines the direction of 
greatest centroid spread, the projected data overlap because of the covariance 
(left panel). The discriminant direction minimizes this overlap for Gaussian data 
(right panel). 


Fisher’s problem therefore amounts to maximizing the Rayleigh quotient, 


a T Ba 

m a X ^W^’ 


(4.15) 


or equivalently 

maxa T Ba subject to a T W a = 1. (4-16) 

a 

This is a generalized eigenvalue problem, with a given by the largest 
eigenvalue of W -1 B. It is not hard to show (Exercise 4.1) that the optimal 
ai is identical to v± defined above. Similarly one can find the next direction 
02 , orthogonal in W to oi, such that a^Ba 2 /a^Wa 2 is maximized; the 
solution is 02 = V 2 , and so on. The at are referred to as discriminant 
coordinates, not to be confused with discriminant functions. They are also 
referred to as canonical variates, since an alternative derivation of these 
results is through a canonical correlation analysis of the indicator response 
matrix Y on the predictor matrix X. This line is pursued in Section 12.5. 

To summarize the developments so far: 


• Gaussian classification with common covariances leads to linear deci¬ 
sion boundaries. Classification can be achieved by sphering the data 
with respect to W, and classifying to the closest centroid (modulo 
log7Tfc) in the sphered space. 

• Since only the relative distances to the centroids count, one can con¬ 
fine the data to the subspace spanned by the centroids in the sphered 
space. 

• This subspace can be further decomposed into successively optimal 
subspaces in term of centroid separation. This decomposition is iden¬ 
tical to the decomposition due to Fisher. 
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LDA and Dimension Reduction on the Vowel Data 
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FIGURE 4.10. Training and test error rates for the vowel data, as a function 
of the dimension of the discriminant subspace. In this case the best error rate is 
for dimension 2. Figure 4-11 shows the decision boundaries in this space. 

The reduced subspaces have been motivated as a data reduction (for 
viewing) tool. Can they also be used for classification, and what is the 
rationale? Clearly they can, as in our original derivation; we simply limit 
the distance-to-centroid calculations to the chosen subspace. One can show 
that this is a Gaussian classification rule with the additional restriction 
that the centroids of the Gaussians lie in a L-dimensional subspace of IR P . 
Fitting such a model by maximum likelihood, and then constructing the 
posterior probabilities using Bayes’ theorem amounts to the classification 
rule described above (Exercise 4.8). 

Gaussian classification dictates the log7Tfc correction factor in the dis¬ 
tance calculation. The reason for this correction can be seen in Figure 4.9. 
The misclassification rate is based on the area of overlap between the two 
densities. If the 7Tfc are equal (implicit in that figure), then the optimal 
cut-point is midway between the projected means. If the rtk are not equal, 
moving the cut-point toward the smaller class will improve the error rate. 
As mentioned earlier for two classes, one can derive the linear rule using 
LDA (or any other method), and then choose the cut-point to minimize 
misclassification error over the training data. 

As an example of the benefit of the reduced-rank restriction, we return 
to the vowel data. There are 11 classes and 10 variables, and hence 10 
possible dimensions for the classifier. We can compute the training and 
test error in each of these hierarchical subspaces; Figure 4.10 shows the 
results. Figure 4.11 shows the decision boundaries for the classifier based 
on the two-dimensional LDA solution. 

There is a close connection between Fisher’s reduced rank discriminant 
analysis and regression of an indicator response matrix. It turns out that 







Canonical Coordinate 2 
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Classification in Reduced Subspace 



Canonical Coordinate 1 

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-di¬ 
mensional subspace spanned by the first two canonical variates. Note that in 
any higher-dimensional subspace, the decision boundaries are higher-dimensional 
affine planes, and could not be represented as lines. 
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LDA amounts to the regression followed by an eigen-decomposition of 
Y t Y. In the case of two classes, there is a single discriminant variable 
that is identical up to a scalar multiplication to either of the columns of Y. 
These connections are developed in Chapter 12. A related fact is that if one 
transforms the original predictors X to Y, then LDA using Y is identical 
to LDA in the original space (Exercise 4.3). 


4.4 Logistic Regression 

The logistic regression model arises from the desire to model the posterior 
probabilities of the K classes via linear functions in x , while at the same 
time ensuring that they sum to one and remain in [0,1]. The model has 
the form 


, Pr(G = 1\X = x) 
° S p r (G = K\X = x) 
, Pr(G = 2\X = x ) 
° g Pr(G = K\X = x) 


ho + 0\ x 
ho + @2 x 


(4.17) 


log 


Pr(G = AT-l|Y = 
Pr(G = K\X = 


j— — P(K- 1)0 + Pk— 1 x - 


The model is specified in terms of K — 1 log-odds or logit transformations 
(reflecting the constraint that the probabilities sum to one). Although the 
model uses the last class as the denominator in the odds-ratios, the choice 
of denominator is arbitrary in that the estimates are equivariant under this 
choice. A simple calculation shows that 


Pr(G = k\X = x) 


Pr(G = K\X = x) 


exp(/3 fc0 + x) 

1 + Efci 1 ex P(Ao + Pjx)' 
1 

1 + Efci 1 ex P(Ao + Pjx)' 


k = 1,... ,K — 1, 
(4.18) 


and they clearly sum to one. To emphasize the dependence on the entire pa¬ 
rameter set 9 = {/3io,/3i\ • • •, /3(k-i)0i 4L-iL we denote the probabilities 
Pr(G = k\X = x) = Pk{x\ 9). 

When K = 2, this model is especially simple, since there is only a single 
linear function. It is widely used in biostatistical applications where binary 
responses (two classes) occur quite frequently. For example, patients survive 
or die, have heart disease or not, or a condition is present or absent. 
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4-4-1 Fitting Logistic Regression Models 

Logistic regression models are usually fit by maximum likelihood, using the 
conditional likelihood of G given X. Since Pr(G|X) completely specifies the 
conditional distribution, the multinomial distribution is appropriate. The 
log-likelihood for N observations is 

N 

1(6) = Y Xo &Pg* (**; 0 )> (4-19) 

i=1 


where Pk(xi\ 9) = Pr(G = k\X = xp, 9). 

We discuss in detail the two-class case, since the algorithms simplify 
considerably. It is convenient to code the two-class g,j via a 0/1 response y t , 
where yi = 1 when gi = 1, and yi = 0 when gi = 2. Let pi(x; 9) = p(x; 9), 
and P 2 (x; 9) = 1 — p{x\6). The log-likelihood can be written 

N 

1(P) = /?) + (! — 2/i) iogC 1 — /5))} 

2=1 

N 

= J2{y^ Txi - loga + e^)}. (4.20) 

i= 1 

Here /3 = {/3io,/3i}, and we assume that the vector of inputs Xi includes 
the constant term 1 to accommodate the intercept. 

To maximize the log-likelihood, we set its derivatives to zero. These score 
equations are 

= Y Xi ( yi ~ p ( Xi > p)) = °’ ( 4 - 21 ) 

P i—1 

which are p+ 1 equations nonlinear in /3. Notice that since the first compo¬ 
nent of Xi is 1, the first score equation specifies that YliLi Vi = Sili P( x i'i /?); 
the expected number of class ones matches the observed number (and hence 
also class twos.) 

To solve the score equations (4.21), we use the Newton Raphson algo¬ 
rithm, which requires the second-derivative or Hessian matrix 


d 2 t(/3) 

d/3d/3 T 


N 

= ~Y XiXi T p{xi \/?)(1 - p(xi\ 13)). 

2 = 1 


Starting with /3 old , a single Newton update is 


13 


new 


/3 old 


<9 2 £(/3) \ 1 d£{/3) 

d/3d/3 T J d/3 


(4.22) 


(4.23) 


where the derivatives are evaluated at j3 old . 
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It is convenient to write the score and Hessian in matrix notation. Let 
y denote the vector of y* values, X the N x (p + 1) matrix of Xi values, 
p the vector of fitted probabilities with ith element p(xp,/3 old ) and W a 
N x N diagonal matrix of weights with ith diagonal element p{xp, /3 old )(l — 
p(x /3 old )). Then we have 


dm 

d/3 

8 2 m 


= X T (y-p) 
= —X T WX 


d/3d/3 T 

The Newton step is thus 

/3 new = ^oid + ( X T WX) ” 1 X T (y - p) 

= (X T WX) _1 X T W (X/3 old + W~ 1 (y — p)) 
= (X T WX)- 1 X T Wz. 


(4.24) 

(4.25) 


(4.26) 


In the second and third line we have re-expressed the Newton step as a 
weighted least squares step, with the response 

z = X/3 old +W- 1 (y-p), (4.27) 


sometimes known as the adjusted response. These equations get solved re¬ 
peatedly, since at each iteration p changes, and hence so does W and z. 
This algorithm is referred to as iteratively reweighted least squares or IRLS, 
since each iteration solves the weighted least squares problem: 

/3 new argmm(z - X/3) T W(z - X/3). (4.28) 


It seems that f3 = 0 is a good starting value for the iterative procedure, 
although convergence is never guaranteed. Typically the algorithm does 
converge, since the log-likelihood is concave, but overshooting can occur. 
In the rare cases that the log-likelihood decreases, step size halving will 
guarantee convergence. 

For the multiclass case (K > 3) the Newton algorithm can also be ex¬ 
pressed as an iteratively reweighted least squares algorithm, but with a 
vector of K — 1 responses and a nondiagonal weight matrix per observation. 
The latter precludes any simplified algorithms, and in this case it is numer¬ 
ically more convenient to work with the expanded vector 8 directly (Ex¬ 
ercise 4.4). Alternatively coordinate-descent methods (Section 3.8.6) can 
be used to maximize the log-likelihood efficiently. The R package glmnet 
(Friedman et al., 2010) can fit very large logistic regression problems ef¬ 
ficiently, both in N and p. Although designed to fit regularized models, 
options allow for unregularized fits. 

Logistic regression models are used mostly as a data analysis and infer¬ 
ence tool, where the goal is to understand the role of the input variables 
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TABLE 4.2. Results from a logistic regression fit to the South African heart 
disease data. 


Coefficient Std. Error Z Score 


(Intercept) 

-4.130 

0.964 

-4.285 

sbp 

0.006 

0.006 

1.023 

tobacco 

0.080 

0.026 

3.034 

ldl 

0.185 

0.057 

3.219 

famhist 

0.939 

0.225 

4.178 

obesity 

-0.035 

0.029 

-1.187 

alcohol 

0.001 

0.004 

0.136 

age 

0.043 

0.010 

4.184 


in explaining the outcome. Typically many models are fit in a search for a 
parsimonious model involving a subset of the variables, possibly with some 
interactions terms. The following example illustrates some of the issues 
involved. 


4-4-2 Example: South African Heart Disease 

Here we present an analysis of binary data to illustrate the traditional 
statistical use of the logistic regression model. The data in Figure 4.12 are a 
subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried 
out in three rural areas of the Western Cape, South Africa (Rousseauw et 
al., 1983). The aim of the study was to establish the intensity of ischemic 
heart disease risk factors in that high-incidence region. The data represent 
white males between 15 and 64, and the response variable is the presence or 
absence of myocardial infarction (MI) at the time of the survey (the overall 
prevalence of MI was 5.1% in this region). There are 160 cases in our data 
set, and a sample of 302 controls. These data are described in more detail 
in Hastie and Tibshirani (1987). 

We fit a logistic-regression model by maximum likelihood, giving the 
results shown in Table 4.2. This summary includes Z scores for each of the 
coefficients in the model (coefficients divided by their standard errors); a 
nonsignificant Z score suggests a coefficient can be dropped from the model. 
Each of these correspond formally to a test of the null hypothesis that the 
coefficient in question is zero, while all the others are not (also known as 
the Wald test). A Z score greater than approximately 2 in absolute value 
is significant at the 5% level. 

There are some surprises in this table of coefficients, which must be in¬ 
terpreted with caution. Systolic blood pressure (sbp) is not significant! Nor 
is obesity, and its sign is negative. This confusion is a result of the corre¬ 
lation between the set of predictors. On their own, both sbp and obesity 
are significant, and with positive sign. However, in the presence of many 
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FIGURE 4.12. A scatterplot matrix of the South African heart disease data. 
Each plot shows a pair of risk factors, and the cases and controls are color coded 
(red is a case). The variable family history of heart disease (famhist) is binary 
(yes or no). 
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TABLE 4.3. Results from stepwise logistic regression fit to South African heart 
disease data. 


Coefficient Std. Error Z score 


(Intercept) 

-4.204 

0.498 

-8.45 

tobacco 

0.081 

0.026 

3.16 

ldl 

0.168 

0.054 

3.09 

famhist 

0.924 

0.223 

4.14 

age 

0.044 

0.010 

4.52 


other correlated variables, they are no longer needed (and can even get a 
negative sign). 

At this stage the analyst might do some model selection; find a subset 
of the variables that are sufficient for explaining their joint effect on the 
prevalence of chd. One way to proceed by is to drop the least significant co¬ 
efficient, and refit the model. This is done repeatedly until no further terms 
can be dropped from the model. This gave the model shown in Table 4.3. 

A better but more time-consuming strategy is to refit each of the models 
with one variable removed, and then perform an analysis of deviance to 
decide which variable to exclude. The residual deviance of a fitted model 
is minus twice its log-likelihood, and the deviance between two models is 
the difference of their individual residual deviances (in analogy to sums-of- 
squares). This strategy gave the same final model as above. 

How does one interpret a coefficient of 0.081 (Std. Error = 0.026) for 
tobacco, for example? Tobacco is measured in total lifetime usage in kilo¬ 
grams, with a median of 1.0kg for the controls and 4.1kg for the cases. Thus 
an increase of 1kg in lifetime tobacco usage accounts for an increase in the 
odds of coronary heart disease of exp(0.081) = 1.084 or 8.4%. Incorporat¬ 
ing the standard error we get an approximate 95% confidence interval of 
exp(0.081 ± 2 x 0.026) = (1.03,1.14). 

We return to these data in Chapter 5, where we see that some of the 
variables have nonlinear effects, and when modeled appropriately, are not 
excluded from the model. 


4-4-3 Quadratic Approximations and Inference 

The maximum-likelihood parameter estimates ft satisfy a self-consistency 
relationship: they are the coefficients of a weighted least squares fit, where 
the responses are 


Zi = xjfi + 


(Vi ~Pi) 
Pi A ~Pi)’ 


(4.29) 
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and the weights are Wi = pi{l—pi) : both depending on j3 itself. Apart from 
providing a convenient algorithm, this connection with least squares has 
more to offer: 




The weighted residual sum-of-squares is the familiar Pearson chi- 
square statistic 


N 


E 


(yi - Pi ) 2 
Pi{ l ~Pi)’ 


(4.30) 


a quadratic approximation to the deviance. 


• Asymptotic likelihood theory says that if the model is correct, then 
$ is consistent (i.e., converges to the true /3). 

• A central limit theorem then shows that the distribution of f3 con¬ 
verges to JV(/3, (X t WX) _ 1 ). This and other asymptotics can be de¬ 
rived directly from the weighted least squares fit by mimicking normal 
theory inference. 


• Model building can be costly for logistic regression models, because 
each model fitted requires iteration. Popular shortcuts are the Rao 
score test which tests for inclusion of a term, and the Wald test which 
can be used to test for exclusion of a term. Neither of these require 
iterative fitting, and are based on the maximum-likelihood fit of the 
current model. It turns out that both of these amount to adding 
or dropping a term from the weighted least squares fit, using the 
same weights. Such computations can be done efficiently, without 
recomputing the entire weighted least squares fit. 


Software implementations can take advantage of these connections. For 
example, the generalized linear modeling software in R (which includes lo¬ 
gistic regression as part of the binomial family of models) exploits them 
fully. GLM (generalized linear model) objects can be treated as linear model 
objects, and all the tools available for linear models can be applied auto¬ 
matically. 


4 - 4-4 Li Regularized Logistic Regression 

The L i penalty used in the lasso (Section 3.4.2) can be used for variable 
selection and shrinkage with any linear regression model. For logistic re¬ 
gression, we would maximize a penalized version of (4.20): 


max < Y] I ViiPo + P T Xi) - log(l + e p0+l}Txi ) 
M y 7=1 ^ 


AElftlj- 


(4.31) 


As with the lasso, we typically do not penalize the intercept term, and stan¬ 
dardize the predictors for the penalty to be meaningful. Criterion (4.31) is 
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concave, and a solution can be found using nonlinear programming meth¬ 
ods (Koh et al., 2007, for example). Alternatively, using the same quadratic 
approximations that were used in the Newton algorithm in Section 4.4.1, 
we can solve (4.31) by repeated application of a weighted lasso algorithm. 
Interestingly, the score equations [see (4.24)] for the variables with non-zero 
coefficients have the form 

xj(y - p) = A • signify), (4.32) 

which generalizes (3.58) in Section 3.4.4; the active variables are tied in 
their generalized correlation with the residuals. 

Path algorithms such as LAR for lasso are more difficult, because the 
coefficient profiles are piecewise smooth rather than linear. Nevertheless, 
progress can be made using quadratic approximations. 

1 2 4 5 6 7 

age 

famhist 
tobacco 


sbp 

alcohol 

obesity 

0.0 0.5 1.0 1.5 2.0 

mmi 

FIGURE 4.13. Li regularized logistic regression coefficients for the South 
African heart disease data, plotted as a function of the L\ norm. The variables 
were all standardized to have unit variance. The profiles are computed exactly at 
each of the plotted points. 

Figure 4.13 shows the L\ regularization path for the South African 
heart disease data of Section 4.4.2. This was produced using the R package 
glmpath (Park and Hastie, 2007), which uses predictor-corrector methods 
of convex optimization to identify the exact values of A at which the active 
set of non-zero coefficients changes (vertical lines in the figure). Here the 
profiles look almost linear; in other examples the curvature will be more 
visible. 

Coordinate descent methods (Section 3.8.6) are very efficient for comput¬ 
ing the coefficient profiles on a grid of values for A. The R package glmnet 
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(Friedman et al., 2010) can fit coefficient paths for very large logistic re¬ 
gression problems efficiently (large in N or p) . Their algorithms can exploit 
sparsity in the predictor matrix X, which allows for even larger problems. 
See Section 18.4 for more details, and a discussion of L \-regularized multi¬ 
nomial models. 


4-4-5 Logistic Regression or LDA? 


In Section 4.3 we find that the log-posterior odds between class k and K 
are linear functions of x (4.9): 


Pr(G = k\X = x) 
° g p r (G = K\X = x) 


log—— -(g k + g K ) T Y, 1 (g k - g K ) 
n K 2 

Tx S (g k g-K') 

a k0 + a k x. (4.33) 


This linearity is a consequence of the Gaussian assumption for the class 
densities, as well as the assumption of a common covariance matrix. The 
linear logistic model (4.17) by construction has linear logits: 


Pr(G = k\X = x) 
‘° g Pr(G = K\X = x) 


P k Q+0kX- 


(4.34) 


It seems that the models are the same. Although they have exactly the same 
form, the difference lies in the way the linear coefficients are estimated. The 
logistic regression model is more general, in that it makes less assumptions. 
We can write the joint density of X and G as 


Pr(X, G = k)= Pr(X)Pr(G = k\X), (4.35) 


where Pr(X) denotes the marginal density of the inputs X. For both LDA 
and logistic regression, the second term on the right has the logit-linear 
form 


Pr(G = k\X = x) 


e 0ko+Sk x 

1 + Efc - ! 1 e^+PT x ’ 


(4.36) 


where we have again arbitrarily chosen the last class as the reference. 

The logistic regression model leaves the marginal density of X as an arbi¬ 
trary density function Pr(X), and fits the parameters of Pr(G|X) by max¬ 
imizing the conditional likelihood —the multinomial likelihood with proba¬ 
bilities the Pr(G = k\X). Although Pr(A') is totally ignored, we can think 
of this marginal density as being estimated in a fully nonparametric and 
unrestricted fashion, using the empirical distribution function which places 
mass 1/iV at each observation. 

With LDA we fit the parameters by maximizing the full log-likelihood, 
based on the joint density 


Pr(X, G = k) = (j>(X; g k - SW, 


(4.37) 
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where </> is the Gaussian density function. Standard normal theory leads 
easily to the estimates and TVk given in Section 4.3. Since the linear 

parameters of the logistic form (4.33) are functions of the Gaussian param¬ 
eters, we get their maximum-likelihood estimates by plugging in the corre¬ 
sponding estimates. However, unlike in the conditional case, the marginal 
density Pr(X) does play a role here. It is a mixture density 

K 

Pr(X) = ^^ fc «/»(X; Mfe ,S), (4.38) 

fc=l 


which also involves the parameters. 

What role can this additional component/restriction play? By relying 
on the additional model assumptions, we have more information about the 
parameters, and hence can estimate them more efficiently (lower variance). 
If in fact the true fk(x) are Gaussian, then in the worst case ignoring this 
marginal part of the likelihood constitutes a loss of efficiency of about 30% 
asymptotically in the error rate (Efron, 1975). Paraphrasing: with 30% 
more data, the conditional likelihood will do as well. 

For example, observations far from the decision boundary (which are 
down-weighted by logistic regression) play a role in estimating the common 
covariance matrix. This is not all good news, because it also means that 
LDA is not robust to gross outliers. 

From the mixture formulation, it is clear that even observations without 
class labels have information about the parameters. Often it is expensive 
to generate class labels, but unclassified observations come cheaply. By 
relying on strong model assumptions, such as here, we can use both types 
of information. 

The marginal likelihood can be thought of as a regularizes requiring 
in some sense that class densities be visible from this marginal view. For 
example, if the data in a two-class logistic regression model can be per¬ 
fectly separated by a hyperplane, the maximum likelihood estimates of the 
parameters are undefined (i.e., infinite; see Exercise 4.5). The LDA coeffi¬ 
cients for the same data will be well defined, since the marginal likelihood 
will not permit these degeneracies. 

In practice these assumptions are never correct, and often some of the 
components of X are qualitative variables. It is generally felt that logistic 
regression is a safer, more robust bet than the LDA model, relying on fewer 
assumptions. It is our experience that the models give very similar results, 
even when LDA is used inappropriately, such as with qualitative predictors. 
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FIGURE 4.14. A toy example with two classes separable by a hyperplane. The 
orange line is the least squares solution, which misclassifies one of the training 
points. Also shown are two blue separating hyperplanes found by the perceptron 
learning algorithm with different random starts. 


4.5 Separating Hyperplanes 

We have seen that linear discriminant analysis and logistic regression both 
estimate linear decision boundaries in similar but slightly different ways. 
For the rest of this chapter we describe separating hyperplane classifiers. 
These procedures construct linear decision boundaries that explicitly try 
to separate the data into different classes as well as possible. They provide 
the basis for support vector classifiers, discussed in Chapter 12. The math¬ 
ematical level of this section is somewhat higher than that of the previous 
sections. 

Figure 4.14 shows 20 data points in two classes in IR 2 . These data can be 
separated by a linear boundary. Included in the figure (blue lines) are two 
of the infinitely many possible separating hyperplanes. The orange line is 
the least squares solution to the problem, obtained by regressing the —1/1 
response Y on X (with intercept); the line is given by 

{x : /3 0 + PiX! + j3 2 x 2 = 0}. (4.39) 

This least squares solution does not do a perfect job in separating the 
points, and makes one error. This is the same boundary found by LDA, 
in light of its equivalence with linear regression in the two-class case (Sec¬ 
tion 4.3 and Exercise 4.2). 

Classifiers such as (4.39), that compute a linear combination of the input 
features and return the sign, were called perceptrons in the engineering liter- 
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ature in the late 1950s (Rosenblatt, 1958). Perceptrons set the foundations 
for the neural network models of the 1980s and 1990s. 

Before we continue, let us digress slightly and review some vector algebra. 
Figure 4.15 depicts a hyperplane or affine set L defined by the equation 
f(x) = /3 0 + (3 T x = 0; since we are in IR 2 this is a line. 

Here we list some properties: 

1. For any two points X\ and X 2 lying in L, (3 T (x i — X 2 ) = 0, and hence 
/3* = /J/||/3|| is the vector normal to the surface of L. 

2. For any point xo in L , (3 T Xo = —/3q. 

3. The signed distance of any point x to L is given by 

X T (z-xo) = pj[^ T(E + ^ 0 * 

= ii?MI /(x> ' (440) 


Hence f(x) is proportional to the signed distance from x to the hyperplane 
defined by f(x) = 0. 

4-5.1 Rosenblatt’s Perceptron Learning Algorithm 

The perceptron learning algorithm tries to find a separating hyperplane by 
minimizing the distance of misclassified points to the decision boundary. If 
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a response j/,; = 1 is misclassified, then xf /3 + /.% <0, and the opposite for 
a misclassified response with yi = —1. The goal is to minimize 

D(P,/3 0 ) = - Y yi( x iP + Po), (4.41) 

ieM 

where A4 indexes the set of misclassified points. The quantity is non¬ 
negative and proportional to the distance of the misclassified points to 
the decision boundary defined by /3 T x + po = 0. The gradient (assuming 
A4 is fixed) is given by 


dp 

d(3 0 


- Y Vi x i> 

i£M 

- Y y *■ 

i£M 


(4.42) 

(4.43) 


The algorithm in fact uses stochastic gradient descent to minimize this 
piecewise linear criterion. This means that rather than computing the sum 
of the gradient contributions of each observation followed by a step in the 
negative gradient direction, a step is taken after each observation is visited. 
Hence the misclassified observations are visited in some sequence, and the 
parameters /3 are updated via 



Here p is the learning rate, which in this case can be taken to be 1 without 
loss in generality. If the classes are linearly separable, it can be shown that 
the algorithm converges to a separating hyperplane in a finite number of 
steps (Exercise 4.6). Figure 4.14 shows two solutions to a toy problem, each 
started at a different random guess. 

There are a number of problems with this algorithm, summarized in 
Ripley (1996): 

• When the data are separable, there are many solutions, and which 
one is found depends on the starting values. 

• The “finite” number of steps can be very large. The smaller the gap, 
the longer the time to find it. 

• When the data are not separable, the algorithm will not converge, 
and cycles develop. The cycles can be long and therefore hard to 
detect. 


The second problem can often be eliminated by seeking a hyperplane not 
in the original space, but in a much enlarged space obtained by creating 




132 


4. Linear Methods for Classification 


many basis-function transformations of the original variables. This is anal¬ 
ogous to driving the residuals in a polynomial regression problem down 
to zero by making the degree sufficiently large. Perfect separation cannot 
always be achieved: for example, if observations from two different classes 
share the same input. It may not be desirable either, since the resulting 
model is likely to be overfit and will not generalize well. We return to this 
point at the end of the next section. 

A rather elegant solution to the first problem is to add additional con¬ 
straints to the separating hyperplane. 

4-5.2 Optimal Separating Hyperplanes 

The optimal separating hyperplane separates the two classes and maximizes 
the distance to the closest point from either class (Vapnik, 1996). Not only 
does this provide a unique solution to the separating hyperplane problem, 
but by maximizing the margin between the two classes on the training data, 
this leads to better classification performance on test data. 

We need to generalize criterion (4.41). Consider the optimization problem 

max M 
,/3 0 ,11/311 = 1 

subject to yi{xjf3 + /3q) > M, i = 1,..., N. 

The set of conditions ensure that all the points are at least a signed 
distance M from the decision boundary defined by (3 and /3o, and we seek 
the largest such M and associated parameters. We can get rid of the ||/3|| = 
1 constraint by replacing the conditions with 

jj4j yi(?lP +Pa) > M 1 (4-46) 

(which redefines f3o) or equivalently 

yi (xJp + fa) > M\\P\\. (4.47) 

Since for any (3 and /3q satisfying these inequalities, any positively scaled 
multiple satisfies them too, we can arbitrarily set ||/3|| = 1/M. Thus (4.45) 
is equivalent to 



subject to yi{xj /3 + f3o) > 1, i = 1,..., N. 


(4.48) 


In light of (4.40), the constraints define an empty slab or margin around the 
linear decision boundary of thickness 1/||/3||. Hence we choose (3 and f3o to 
maximize its thickness. This is a convex optimization problem (quadratic 
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criterion with linear inequality constraints). The Lagrange (primal) func¬ 
tion, to be minimized w.r.t. p and /3o, is 

1 N 

L P = 2 11^11 2 - ^2 a i[yi( x TP + M - !]• (4.49) 

2=1 

Setting the derivatives to zero, we obtain: 

N 

P = Y. aiViXi, 

2=1 
N 

0 = ya l y l , 

i= 1 

and substituting these in (4.49) we obtain the so-called Wolfe dual 

N N N 

= 5>-*EE otia k y z y k xf x k 

i= 1 2=1 k =1 

N 

subject to ai > 0 and ^ aiyi = 0. (4-52) 

i=i 

The solution is obtained by maximizing Lp in the positive orthant, a sim¬ 
pler convex optimization problem, for which standard software can be used. 
In addition the solution must satisfy the Karush-Kuhn-Tucker conditions, 
which include (4.50), (4.51), (4.52) and 

ai[yi{xf p + p 0 ) ~ 1] = 0 Vi. (4.53) 

From these we can see that 

• if a, > 0, then yi{xjP + /?o) = 1, or in other words, xi is on the 
boundary of the slab; 

• if yi(xf p + Po) > 1, Si is not on the boundary of the slab, and ai = 0. 

From (4.50) we see that the solution vector /3 is defined in terms of a linear 
combination of the support points Xi —those points defined to be on the 
boundary of the slab via ai > 0. Figure 4.16 shows the optimal separating 
hyperplane for our toy example; there are three support points. Likewise, 
f3 0 is obtained by solving (4.53) for any of the support points. 

The optimal separating hyperplane produces a function f(x) = x T /3 + /3o 
for classifying new observations: 

G(x) = sign/ (x). (4.54) 

Although none of the training observations fall in the margin (by con¬ 
struction), this will not necessarily be the case for test observations. The 


(4.50) 

(4.51) 
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FIGURE 4.16. The same data as in Figure f.lf. The shaded region delineates 
the maximum margin separating the two classes. There are three support points 
indicated, which lie on the boundary of the margin, and the optimal separating 
hyperplane (blue line) bisects the slab. Included in the figure is the boundary found 
using logistic regression (red line), which is very close to the optimal separating 
hyperplane (see Section 12.3.3). 


intuition is that a large margin on the training data will lead to good 
separation on the test data. 

The description of the solution in terms of support points seems to sug¬ 
gest that the optimal hyperplane focuses more on the points that count, 
and is more robust to model misspecification. The LDA solution, on the 
other hand, depends on all of the data, even points far away from the de¬ 
cision boundary. Note, however, that the identification of these support 
points required the use of all the data. Of course, if the classes are really 
Gaussian, then LDA is optimal, and separating hyperplanes will pay a price 
for focusing on the (noisier) data at the boundaries of the classes. 

Included in Figure 4.16 is the logistic regression solution to this prob¬ 
lem, fit by maximum likelihood. Both solutions are similar in this case. 
When a separating hyperplane exists, logistic regression will always find 
it, since the log-likelihood can be driven to 0 in this case (Exercise 4.5). 
The logistic regression solution shares some other qualitative features with 
the separating hyperplane solution. The coefficient vector is defined by a 
weighted least squares fit of a zero-mean linearized response on the input 
features, and the weights are larger for points near the decision boundary 
than for those further away. 

When the data are not separable, there will be no feasible solution to 
this problem, and an alternative formulation is needed. Again one can en¬ 
large the space using basis transformations, but this can lead to artificial 
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separation through over-fitting. In Chapter 12 we discuss a more attractive 
alternative known as the support vector machine , which allows for overlap, 
but minimizes a measure of the extent of this overlap. 

Bibliographic Notes 

Good general texts on classification include Duda et al. (2000), Hand 
(1981), McLachlan (1992) and Ripley (1996). Mardia et al. (1979) have 
a concise discussion of linear discriminant analysis. Michie et al. (1994) 
compare a large number of popular classifiers on benchmark datasets. Lin¬ 
ear separating hyperplanes are discussed in Vapnik (1996). Our account of 
the perceptron learning algorithm follows Ripley (1996). 


Exercises 


Ex. 4.1 Show how to solve the generalized eigenvalue problem maxa T Ba 
subject to a T Wa = 1 by transforming to a standard eigenvalue problem. 


Ex. 4.2 Suppose we have features x £ IR P , a two-class response, with class 
sizes Ni , N 2 , and the target coded as —N/N\,N/N 2 . 

(a) Show that the LDA rule classifies to class 2 if 



and class 1 otherwise. 


(b) Consider minimization of the least squares criterion 


N 



(4.55) 


Show that the solution /3 satisfies 


(N — 2 )£ + 1V£b (3 = N{(i 2 — Ai) 


(4.56) 


(after simplification), where = N j^ 2 (A2 ~ A1XA2 — Ai) T - 

(c) Hence show that Ss/3 is in the direction (p 2 — fi\) and thus 

P oc S (£2 - Ai)- 


(4.57) 


Therefore the least-squares regression coefficient is identical to the 
LDA coefficient, up to a scalar multiple. 
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(d) Show that this result holds for any (distinct) coding of the two classes. 

(e) Find the solution Po (up to the same scalar multiple as in (c), and 

hence the predicted value f(x) = $ 0 + x T $. Consider the following 
rule: classify to class 2 if f(x) > 0 and class 1 otherwise. Show this is 
not the same as the LDA rule unless the classes have equal numbers 
of observations. 

(Fisher, 1936; Ripley, 1996) 

Ex. 4.3 Suppose we transform the original predictors X to Y via linear 
regression. In detail, let Y = X(X T X) -1 X T Y = XB, where Y is the 
indicator response matrix. Similarly for any input x £ IR P , we get a trans¬ 
formed vector y = ti T x £ IR A . Show that LDA using Y is identical to 
LDA in the original space. 

Ex. 4.4 Consider the multilogit model with K classes (4.17). Let p be the 
(p + l)(K — l)-vector consisting of all the coefficients. Define a suitably 
enlarged version of the input vector x to accommodate this vectorized co¬ 
efficient matrix. Derive the Newton-Raphson algorithm for maximizing the 
multinomial log-likelihood, and describe how you would implement this 
algorithm. 

Ex. 4.5 Consider a two-class logistic regression problem with x £ IR. Char¬ 
acterize the maximum-likelihood estimates of the slope and intercept pa¬ 
rameter if the sample x, for the two classes are separated by a point xq £ IR. 
Generalize this result to (a) x £ IR P (see Figure 4.16), and (b) more than 
two classes. 

Ex. 4.6 Suppose we have N points Xi in IR P in general position, with class 
labels yi £ {—1,1}. Prove that the perceptron learning algorithm converges 
to a separating hyperplane in a finite number of steps: 

(a) Denote a hyperplane by /( x) — pf x + po = 0, or in more compact 

notation (3 T x* = 0, where x* = (x, 1) and p = (/3i,/3o). Let Zi = 
a:*/||a;*||. Show that separability implies the existence of a (3 sep such 
that yiPsepZi > 1 V* 

(b) Given a current /3 0 id; the perceptron algorithm identifies a point z t that 

is misclassified, and produces the update /3 new /3 0 id + Vi^i- Show 
that 11/3 new — Psep 11 2 < || Po\d~Psep 11 2 — 1, and hence that the algorithm 
converges to a separating hyperplane in no more than | |/3 sta rt ~ Psep 11 2 
steps (Ripley, 1996). 

Ex. 4.7 Consider the criterion 


N 

D*(p,p 0 ) = -J2y l (xfP + Po), 

i =1 


(4.58) 
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a generalization of (4.41) where we sum over all the observations. Consider 
minimizing D* subject to ||/3|| = 1. Describe this criterion in words. Does 
it solve the optimal separating hyperplane problem? 

Ex. 4.8 Consider the multivariate Gaussian model X\G = k ~ iV(/Xfc,S), 
with the additional restriction that rank{/Zfc}f" = L < max(A' — l,p). 
Derive the constrained MLEs for the pk and £. Show that the Bayes clas¬ 
sification rule is equivalent to classifying in the reduced subspace computed 
by LDA (Hastie and Tibshirani, 1996b). 

Ex. 4.9 Write a computer program to perform a quadratic discriminant 
analysis by fitting a separate Gaussian model per class. Try it out on the 
vowel data, and compute the misclassification error for the test data. The 
data can be found in the book website www-stat. stanford.edu/ElemStatLearn. 
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Basis Expansions and Regularization 


5.1 Introduction 

We have already made use of models linear in the input features, both for 
regression and classification. Linear regression, linear discriminant analysis, 
logistic regression and separating hyperplanes all rely on a linear model. 
It is extremely unlikely that the true function f{X) is actually linear in 
X. In regression problems, f(X) = E(F|X) will typically be nonlinear and 
nonadditive in X , and representing f(X) by a linear model is usually a con¬ 
venient, and sometimes a necessary, approximation. Convenient because a 
linear model is easy to interpret, and is the first-order Taylor approxima¬ 
tion to f(X). Sometimes necessary, because with N small and/or p large, 
a linear model might be all we are able to fit to the data without overfit¬ 
ting. Likewise in classification, a linear, Bayes-optimal decision boundary 
implies that some monotone transformation of Pr(X = 1\X) is linear in X. 
This is inevitably an approximation. 

In this chapter and the next we discuss popular methods for moving 
beyond linearity. The core idea in this chapter is to augment/replace the 
vector of inputs X with additional variables, which are transformations of 
X , and then use linear models in this new space of derived input features. 

Denote by h m (X ) : 1R P K > IR the roth transformation of X, m = 
1,..., M. We then model 


M 

f(X) = £ p m h m (X), 

m— 1 


(5.1) 
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a linear basis expansion in X. The beauty of this approach is that once the 
basis functions hm have been determined, the models are linear in these 
new variables, and the fitting proceeds as before. 

Some simple and widely used examples of the h m are the following: 

• h m (X) = X m , m = 1,. .. ,p recovers the original linear model. 

• h m (X ) = Xj or h m (X) = XjXk allows us to augment the inputs with 
polynomial terms to achieve higher-order Taylor expansions. Note, 
however, that the number of variables grows exponentially in the de¬ 
gree of the polynomial. A full quadratic model in p variables requires 
0(p 2 ) square and cross-product terms, or more generally 0(p d ) for a 
clegree-d polynomial. 

• h m {X) = log(A :/ ), \J~Xj-, ■ ■ ■ permits other nonlinear transformations 
of single inputs. More generally one can use similar functions involv¬ 
ing several inputs, such as h m {X) = ||A||. 

• h m (X) = I(L m < Xk < U m ), an indicator for a region of X &. By 
breaking the range of X up into M & such nonoverlapping regions 
results in a model with a piecewise constant contribution for Xj~. 

Sometimes the problem at hand will call for particular basis functions h m , 
such as logarithms or power functions. More often, however, we use the basis 
expansions as a device to achieve more flexible representations for f(X). 
Polynomials are an example of the latter, although they are limited by 
their global nature—tweaking the coefficients to achieve a functional form 
in one region can cause the function to flap about madly in remote regions. 
In this chapter we consider more useful families of piecewise-polynomials 
and splines that allow for local polynomial representations. We also discuss 
the wavelet bases, especially useful for modeling signals and images. These 
methods produce a dictionary T> consisting of typically a very large number 
\T>\ of basis functions, far more than we can afford to fit to our data. Along 
with the dictionary we require a method for controlling the complexity 
of our model, using basis functions from the dictionary. There are three 
common approaches: 

• Restriction methods, where we decide before-hand to limit the class 
of functions. Additivity is an example, where we assume that our 
model has the form 


p 


f(X) = £//(*>) 



j —1 m— 1 


( 5 . 2 ) 
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The size of the model is limited by the number of basis functions Mo¬ 
used for each component function fj. 

• Selection methods, which adaptively scan the dictionary and include 
only those basis functions h m that contribute significantly to the fit of 
the model. Here the variable selection techniques discussed in Chap¬ 
ter 3 are useful. The stagewise greedy approaches such as CART, 
MARS and boosting fall into this category as well. 

• Regularization methods where we use the entire dictionary but re¬ 
strict the coefficients. Ridge regression is a simple example of a regu¬ 
larization approach, while the lasso is both a regularization and selec¬ 
tion method. Here we discuss these and more sophisticated methods 
for regularization. 

5.2 Piecewise Polynomials and Splines 

We assume until Section 5.7 that X is one-dimensional. A piecewise poly¬ 
nomial function f(X) is obtained by dividing the domain of X into contigu¬ 
ous intervals, and representing / by a separate polynomial in each interval. 
Figure 5.1 shows two simple piecewise polynomials. The first is piecewise 
constant, with three basis functions: 

h 1 (X) = I(X h2(X) = I(h < X < &), h 3 (X) = Ifa<X). 

Since these are positive over disjoint regions, the least squares estimate of 
the model f(X) = Xlm=i Pmhm{X) amounts to /3 m = Y m , the mean of Y 
in the mth region. 

The top right panel shows a piecewise linear fit. Three additional basis 
functions are needed: h m +3 = h m {X)X, m = 1,..., 3. Except in special 
cases, we would typically prefer the third panel, which is also piecewise 
linear, but restricted to be continuous at the two knots. These continu¬ 
ity restrictions lead to linear constraints on the parameters; for example, 
f(£i) = f(£i) implies that /3i +£i /34 = fa In this case, since there 

are two restrictions, we expect to get back two parameters, leaving four free 
parameters. 

A more direct way to proceed in this case is to use a basis that incorpo¬ 
rates the constraints: 


hi(X) = 1, h 2 (X) = X, h 3 (X) = (X - &)+, hi{X) = {X - £ 2 )+, 

where t+ denotes the positive part. The function h 3 is shown in the lower 
right panel of Figure 5.1. We often prefer smoother functions, and these 
can be achieved by increasing the order of the local polynomial. Figure 5.2 
shows a series of piecewise-cubic polynomials fit to the same data, with 
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Piecewise Constant 


Piecewise Linear 




Ci C2 


Ci C2 


Continuous Piecewise Linear 


Piecewise-linear Basis Function 




FIGURE 5.1. The top left panel shows a piecewise constant function fit to some 
artificial data. The broken vertical lines indicate the positions of the two knots 
£1 and £2 ■ The blue curve represents the true function, from which the data were 
generated with Gaussian noise. The remaining two panels show piecewise lin¬ 
ear functions fit to the same data—the top right unrestricted, and the lower left 
restricted to be continuous at the knots. The lower right panel shows a piecewise- 
linear basis function, hs(X) = (X — Ci)+, continuous at £ 1 . The black points 
indicate the sample evaluations hs(xi), i = 1,..., N. 
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Piecewise Cubic Polynomials 


Discontinuous 



Ci C2 


Continuous First Derivative 



Ci C2 


FIGURE 5.2. A series of piecewise-- 
continuity. 


Continuous 



Ci C2 


Continuous Second Derivative 



Ci C2 


polynomials, with increasing orders of 


increasing orders of continuity at the knots. The function in the lower 
right panel is continuous, and has continuous first and second derivatives 
at the knots. It is known as a cubic spline. Enforcing one more order of 
continuity would lead to a global cubic polynomial. It is not hard to show 
(Exercise 5.1) that the following basis represents a cubic spline with knots 
at Ci and C 2 : 


hi{X) = l, h 3 (X) = X 2 , h 5 (X) = (X- Ci)^, 
h 2 (X) = X, h A {X) = X\ K{X) = {X - &)%. 


(5.3) 


There are six basis functions corresponding to a six-dimensional linear space 
of functions. A quick check confirms the parameter count: (3 regions) x (4 
parameters per region) —(2 knots) x (3 constraints per knot)= 6. 
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More generally, an order-M spline with knots £j. j = 1,..., if is a 
piecewise-polynomial of order M, and has continuous derivatives up to 
order M — 2. A cubic spline has M = 4. In fact the piecewise-constant 
function in Figure 5.1 is an order-1 spline, while the continuous piece- 
wise linear function is an order-2 spline. Likewise the general form for the 
truncated-power basis set would be 

hj(X) = Xi- 1 , j = l,...,M, 

h M +e(X) = (X-b) 1 *- 1 , e = l 

It is claimed that cubic splines are the lowest-order spline for which the 
knot-discontinuity is not visible to the human eye. There is seldom any 
good reason to go beyond cubic-splines, unless one is interested in smooth 
derivatives. In practice the most widely used orders are M = 1,2 and 4. 

These fixed-knot splines are also known as regression splines. One needs 
to select the order of the spline, the number of knots and their placement. 
One simple approach is to parameterize a family of splines by the number 
of basis functions or degrees of freedom, and have the observations :r* de¬ 
termine the positions of the knots. For example, the expression bs(x,df=7) 
in R generates a basis matrix of cubic-spline functions evaluated at the N 
observations in x, with the 7— 3 = 4 1 interior knots at the appropriate per¬ 
centiles of x (20, 40, 60 and 80th.) One can be more explicit, however; bs(x, 
degree=l, knots = c(0.2, 0.4, 0.6)) generates a basis for linear splines, 
with three interior knots, and returns an N x 4 matrix. 

Since the space of spline functions of a particular order and knot sequence 
is a vector space, there are many equivalent bases for representing them 
(just as there are for ordinary polynomials.) While the truncated power 
basis is conceptually simple, it is not too attractive numerically: powers of 
large numbers can lead to severe rounding problems. The B-spline basis, 
described in the Appendix to this chapter, allows for efficient computations 
even when the number of knots K is large. 


5.2.1 Natural Cubic Splines 

We know that the behavior of polynomials fit to data tends to be erratic 
near the boundaries, and extrapolation can be dangerous. These problems 
are exacerbated with splines. The polynomials fit beyond the boundary 
knots behave even more wildly than the corresponding global polynomials 
in that region. This can be conveniently summarized in terms of the point- 
wise variance of spline functions fit by least squares (see the example in the 
next section for details on these variance calculations). Figure 5.3 compares 


X A cubic spline with four knots is eight-dimensional. The bs() function omits by 
default the constant term in the basis, since terms like this are typically included with 
other terms in the model. 
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X 

FIGURE 5.3. Pointwise variance curves for four different models, with X con¬ 
sisting of 50 points drawn at random from [7[0,1], and an assumed error model 
with constant variance. The linear and cubic polynomial fits have two and four 
degrees of freedom, respectively, while the cubic spline and natural cubic spline 
each have six degrees of freedom. The cubic spline has two knots at 0.33 and 0.66, 
while the natural spline has boundary knots at 0.1 and 0.9, and four interior knots 
uniformly spaced between them. 


the pointwise variances for a variety of different models. The explosion of 
the variance near the boundaries is clear, and inevitably is worst for cubic 
splines. 

A natural cubic spline adds additional constraints, namely that the func¬ 
tion is linear beyond the boundary knots. This frees up four degrees of 
freedom (two constraints each in both boundary regions), which can be 
spent more profitably by sprinkling more knots in the interior region. This 
tradeoff is illustrated in terms of variance in Figure 5.3. There will be a 
price paid in bias near the boundaries, but assuming the function is lin¬ 
ear near the boundaries (where we have less information anyway) is often 
considered reasonable. 

A natural cubic spline with K knots is represented by K basis functions. 
One can start from a basis for cubic splines, and derive the reduced ba¬ 
sis by imposing the boundary constraints. For example, starting from the 
truncated power series basis described in Section 5.2, we arrive at (Exer¬ 
cise 5.4): 


N 1 (X) = 1, N 2 (X)=X, N k+2 (X)=d k (X)-d K _ 1 (X), (5.4) 
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where 


d k (X) = 


(X-£ k ) 3 + -(X-Z K )% 


(5.5) 


€k ~ Cfc 

Each of these basis functions can be seen to have zero second and third 
derivative for X > £k- 


5.2.2 Example: South African Heart Disease (Continued) 

In Section 4.4.2 we fit linear logistic regression models to the South African 
heart disease data. Here we explore nonlinearities in the functions using 
natural splines. The functional form of the model is 

logit[Pr(chd|jA)] = Oq + h\{^X-f) T + h 2 {X 2 ) T 62 + • • • + h p {X p ) T 9p, (5.6) 

where each of the 6j are vectors of coefficients multiplying their associated 
vector of natural spline basis functions hj. 

We use four natural spline bases for each term in the model. For example, 
with Xi representing sbp, hi{Xi) is a basis consisting of four basis func¬ 
tions. This actually implies three rather than two interior knots (chosen at 
uniform quantiles of sbp), plus two boundary knots at the extremes of the 
data, since we exclude the constant term from each of the hj. 

Since famhist is a two-level factor, it is coded by a simple binary or 
dummy variable, and is associated with a single coefficient in the fit of the 
model. 

More compactly we can combine all p vectors of basis functions (and 
the constant term) into one big vector h(X ), and then the model is simply 
h(X) T 0, with total number of parameters df = 1 + 1 dfj, the sum of 

the parameters in each component term. Each basis function is evaluated 
at each of the N samples, resulting in a IV x df basis matrix H. At this 
point the model is like any other linear logistic model, and the algorithms 
described in Section 4.4.1 apply. 

We carried out a backward stepwise deletion process, dropping terms 
from this model while preserving the group structure of each term, rather 
than dropping one coefficient at a time. The AIC statistic (Section 7.5) was 
used to drop terms, and all the terms remaining in the final model would 
cause AIC to increase if deleted from the model (see Table 5.1). Figure 5.4 
shows a plot of the final model selected by the stepwise regression. The 
functions displayed are fj(Xj) = hj(Xj) T 9j for each variable Xj. The 
covariance matrix Cov(9) = £ is estimated by £ = (H T WH) -1 , where W 
is the diagonal weight matrix from the logistic regression. Hence Vj(Xj) = 
Var [fj(Xj)] = hj(Xj) T ’Sjjhj(Xj) is the pointwise variance function of fj, 
where Co v(0j) = £ 7) is the appropriate sub-matrix of £. The shaded region 
in each panel is defined by fj(Xj) ± 2 y/vf{Xj). 

The AIC statistic is slightly more generous than the likelihood-ratio test 
(deviance test). Both sbp and obesity are included in this model, while 
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15 20 25 30 35 40 45 20 30 40 50 60 

obesity age 


FIGURE 5.4. Fitted natural-spline functions for each of the terms in the final 
model selected by the stepwise procedure. Included are pointwise standard-error 
bands. The rug plot at the base of each figure indicates the location of each of the 
sample values for that variable (jittered to break ties). 
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TABLE 5.1. Final logistic regression model, after stepwise deletion of natural 
splines terms. The column labeled “LRT” is the likelihood-ratio test statistic when 
that term is deleted from the model, and is the change in deviance from the full 
model (labeled “none”). 


Terms 

Df 

Deviance 

AIC 

LRT 

P-value 

none 


458.09 

502.09 



sbp 

4 

467.16 

503.16 

9.076 

0.059 

tobacco 

4 

470.48 

506.48 

12.387 

0.015 

ldl 

4 

472.39 

508.39 

14.307 

0.006 

famhist 

1 

479.44 

521.44 

21.356 

0.000 

obesity 

4 

466.24 

502.24 

8.147 

0.086 

age 

4 

481.86 

517.86 

23.768 

0.000 


they were not in the linear model. The figure explains why, since their 
contributions are inherently nonlinear. These effects at first may come as 
a surprise, but an explanation lies in the nature of the retrospective data. 
These measurements were made sometime after the patients suffered a 
heart attack, and in many cases they had already benefited from a healthier 
diet and lifestyle, hence the apparent increase in risk at low values for 
obesity and sbp. Table 5.1 shows a summary of the selected model. 


5.2.3 Example: Phoneme Recognition 

In this example we use splines to reduce flexibility rather than increase it; 
the application comes under the general heading of functional modeling. In 
the top panel of Figure 5.5 are displayed a sample of 15 log-periodograms 
for each of the two phonemes “aa” and “ao” measured at 256 frequencies. 
The goal is to use such data to classify a spoken phoneme. These two 
phonemes were chosen because they are difficult to separate. 

The input feature is a vector x of length 256, which we can think of as 
a vector of evaluations of a function X(f) over a grid of frequencies /. In 
reality there is a continuous analog signal which is a function of frequency, 
and we have a sampled version of it. 

The gray lines in the lower panel of Figure 5.5 show the coefficients of 
a linear logistic regression model fit by maximum likelihood to a training 
sample of 1000 drawn from the total of 695 “aa”s and 1022 “ao”s. The 
coefficients are also plotted as a function of frequency, and in fact we can 
think of the model in terms of its continuous counterpart 


(5.7) 
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Phoneme Examples 



0 50 100 150 200 250 

Frequency 


Phoneme Classification: Raw and Restricted Logistic Regression 



FIGURE 5.5. The top panel displays the log-periodogram as a function of fre¬ 
quency for 15 examples each of the phonemes “aa” and “ao” sampled from a total 
of 695 “aa”s and 1022 “ao”s. Each log-periodogram is measured at 256 uniformly 
spaced frequencies. The lower panel shows the coefficients (as a function of fre¬ 
quency) of a logistic regression fit to the data by maximum likelihood, using the 
256 log-periodogram values as inputs. The coefficients are restricted to be smooth 
in the red curve, and are unrestricted in the jagged gray curve. 
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which we approximate by 

256 256 

= £*,■&■ (5-8) 

l=i l=i 

The coefficients compute a contrast functional, and will have appreciable 
values in regions of frequency where the log-periodograms differ between 
the two classes. 

The gray curves are very rough. Since the input signals have fairly strong 
positive autocorrelation, this results in negative autocorrelation in the co¬ 
efficients. In addition the sample size effectively provides only four obser¬ 
vations per coefficient. 

Applications such as this permit a natural regularization. We force the 
coefficients to vary smoothly as a function of frequency. The red curve in the 
lower panel of Figure 5.5 shows such a smooth coefficient curve fit to these 
data. We see that the lower frequencies offer the most discriminatory power. 
Not only does the smoothing allow easier interpretation of the contrast, it 
also produces a more accurate classifier: 



Raw 

Regularized 

Training error 

0.080 

0.185 

Test error 

0.255 

0.158 


The smooth red curve was obtained through a very simple use of natural 
cubic splines. We can represent the coefficient function as an expansion of 
splines /3(f) = X)m=i h m (f)9 m - In practice this means that /? = H 9 where, 
H is a p x M basis matrix of natural cubic splines, defined on the set of 
frequencies. Here we used M = 12 basis functions, with knots uniformly 
placed over the integers 1,2, ...,256 representing the frequencies. Since 
x T /3 = x T H9 , we can simply replace the input features x by their filtered 
versions x* = H t x : and fit 9 by linear logistic regression on the x*. The 
red curve is thus (3(f) = h(f) T 9. 


5.3 Filtering and Feature Extraction 

In the previous example, we constructed a px M basis matrix H, and then 
transformed our features x into new features x* = H T x. These filtered 
versions of the features were then used as inputs into a learning procedure: 
in the previous example, this was linear logistic regression. 

Preprocessing of high-dimensional features is a very general and pow¬ 
erful method for improving the performance of a learning algorithm. The 
preprocessing need not be linear as it was above, but can be a general 
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(nonlinear) function of the form x* = g{x). The derived features x* can 
then be used as inputs into any (linear or nonlinear) learning procedure. 

For example, for signal or image recognition a popular approach is to first 
transform the raw features via a wavelet transform x* = H T x (Section 5.9) 
and then use the features x* as inputs into a neural network (Chapter 11). 
Wavelets are effective in capturing discrete jumps or edges, and the neural 
network is a powerful tool for constructing nonlinear functions of these 
features for predicting the target variable. By using domain knowledge 
to construct appropriate features, one can often improve upon a learning 
method that has only the raw features x at its disposal. 


5.4 Smoothing Splines 

Here we discuss a spline basis method that avoids the knot selection prob¬ 
lem completely by using a maximal set of knots. The complexity of the fit 
is controlled by regularization. Consider the following problem: among all 
functions f(x) with two continuous derivatives, find one that minimizes the 
penalized residual sum of squares 

N 

RSS(/, A) = — f(xi)} 2 + A 

2 — 1 

where A is a fixed smoothing parameter. The first term measures closeness 
to the data, while the second term penalizes curvature in the function, and 
A establishes a tradeoff between the two. Two special cases are: 

A = 0 : / can be any function that interpolates the data. 

A = oo : the simple least squares line fit, since no second derivative can 
be tolerated. 

These vary from very rough to very smooth, and the hope is that A € (0, oo) 
indexes an interesting class of functions in between. 

The criterion (5.9) is defined on an infinite-dimensional function space— 
in fact, a Sobolev space of functions for which the second term is defined. 
Remarkably, it can be shown that (5.9) has an explicit, finite-dimensional, 
unique minimizer which is a natural cubic spline with knots at the unique 
values of the Xi, i = 1,... ,N (Exercise 5.7). At face value it seems that 
the family is still over-parametrized, since there are as many as N knots, 
which implies N degrees of freedom. However, the penalty term translates 
to a penalty on the spline coefficients, which are shrunk some of the way 
toward the linear fit. 

Since the solution is a natural spline, we can write it as 

N 

f(x) ='^2 N j (x)6 j , 

3 =1 


J {rm 2 dt , ( 5 . 9 ) 


(5.10) 


152 5. Basis Expansions and Regularization 


CD 

"ctf 


Q_ 

C/D 


-C 

O 

0 

> 

<5 



Age 

FIGURE 5.6. The response is the relative change in bone mineral density mea¬ 
sured at the spine in adolescents, as a function of age. A separate smoothing spline 
was fit to the males and females, with A « 0.00022. This choice corresponds to 
about 12 degrees of freedom. 


where the Nj (x) are an TV-dimensional set of basis functions for repre¬ 
senting this family of natural splines (Section 5.2.1 and Exercise 5.4). The 
criterion thus reduces to 


RSS(6>, A) = (y - N6») T (y - N0) + A 9 T fl N 6, 


(5.11) 


where {N}^ = N,(xi) and {Ojvjyfc = / N"(t)N^(t)dt. The solution is 
easily seen to be 

9 = (N t N + Af2Ar) -1 N T y, (5.12) 

a generalized ridge regression. The fitted smoothing spline is given by 


N 


fix) = J2 n j( x )9j- 

3 = 1 


(5.13) 


Efficient computational techniques for smoothing splines are discussed in 
the Appendix to this chapter. 

Figure 5.6 shows a smoothing spline fit to some data on bone mineral 
density (BMD) in adolescents. The response is relative change in spinal 
BMD over two consecutive visits, typically about one year apart. The data 
are color coded by gender, and two separate curves were fit. This simple 
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summary reinforces the evidence in the data that the growth spurt for 
females precedes that for males by about two years. In both cases the 
smoothing parameter A was approximately 0.00022; this choice is discussed 
in the next section. 

5.4.1 Degrees of Freedom and Smoother Matrices 

We have not yet indicated how A is chosen for the smoothing spline. Later 
in this chapter we describe automatic methods using techniques such as 
cross-validation. In this section we discuss intuitive ways of prespecifying 
the amount of smoothing. 

A smoothing spline with prechosen A is an example of a linear smoother 
(as in linear operator). This is because the estimated parameters in (5.12) 
are a linear combination of the y,;. Denote by f the N -vector of fitted values 
f(xi) at the training predictors Xi . Then 


f = N(N T N +An Ar )" 1 N T y 
= S A y. 


(5.14) 


Again the fit is linear in y, and the finite linear operator S A is known as 
the smoother matrix. One consequence of this linearity is that the recipe 
for producing f from y does not depend on y itself; S A depends only on 
the Xi and A. 

Linear operators are familiar in more traditional least squares fitting as 
well. Suppose is a N x M matrix of M cubic-spline basis functions 
evaluated at the N training points Xi, with knot sequence £, and M <C N. 
Then the vector of fitted spline values is given by 


f = BffBfB^Bjy 
= H ? y. 


(5.15) 


Here the linear operator is a projection operator, also known as the hat 
matrix in statistics. There are some important similarities and differences 
between and S A : 

• Both are symmetric, positive semidefinite matrices. 

• (idempotent), while S A S A ^ S A , meaning that the right- 
hand side exceeds the left-hand side by a positive semidefinite matrix. 
This is a consequence of the shrinking nature of S A , which we discuss 
further below. 

• has rank M, while S A has rank N. 

The expression M = trace(H^) gives the dimension of the projection space, 
which is also the number of basis functions, and hence the number of pa¬ 
rameters involved in the fit. By analogy we define the effective degrees of 
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freedom of a smoothing spline to be 

df^ = trace(S.\), (5.16) 


the sum of the diagonal elements of Sa- This very useful definition allows 
us a more intuitive way to parameterize the smoothing spline, and indeed 
many other smoothers as well, in a consistent fashion. For example, in Fig¬ 
ure 5.6 we specified df>, = 12 for each of the curves, and the corresponding 
A ~ 0.00022 was derived numerically by solving trace(SA) = 12. There are 
many arguments supporting this definition of degrees of freedom, and we 
cover some of them here. 

Since Sa is symmetric (and positive semidefinite), it has a real eigen- 
decomposition. Before we proceed, it is convenient to rewrite Sa in the 
Reinsch form 

Sa = (I + AK)-\ (5.17) 

where K does not depend on A (Exercise 5.9). Since f = SaY solves 

min(y-f) T (y-f) + Af T Kf, (5.18) 

K is known as the penalty matrix, and indeed a quadratic form in K has 
a representation in terms of a weighted sum of squared (divided) second 
differences. The eigen-decomposition of Sa is 

N 

Sa = ^Pfc(A) u fc u fe (5.19) 

k =1 


with 


PkW 


1 

1 + \dk ’ 


(5.20) 


and dk the corresponding eigenvalue of K. Figure 5.7 (top) shows the re¬ 
sults of applying a cubic smoothing spline to some air pollution data (128 
observations). Two fits are given: a smoother fit corresponding to a larger 
penalty A and a rougher fit for a smaller penalty. The lower panels repre¬ 
sent the eigenvalues (lower left) and some eigenvectors (lower right) of the 
corresponding smoother matrices. Some of the highlights of the eigenrep- 
resentation are the following: 


• The eigenvectors are not affected by changes in A, and hence the whole 
family of smoothing splines (for a particular sequence x) indexed by 
A have the same eigenvectors. 

• SaY = u fcPfc(A)(ufc, y), and hence the smoothing spline oper¬ 

ates by decomposing y w.r.t. the (complete) basis {u^}, and differ¬ 
entially shrinking the contributions using pk( A). This is to be con¬ 
trasted with a basis-regression method, where the components are 
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Daggot Pressure Gradient 



FIGURE 5.7. (Top:) Smoothing spline fit of ozone concentration versus Daggot 
pressure gradient. The two fits correspond to different values of the smoothing 
parameter, chosen to achieve five and eleven effective degrees of freedom, defined 
by df x = trace(S\). (Lower left:) First 25 eigenvalues for the two smoothing-spline 
matrices. The first two are exactly 1, and all are > 0. (Lower right:) Third to 
sixth eigenvectors of the spline smoother matrices. In each case, is plotted 
against x, and as such is viewed as a function of x. The rug at the base of the 
plots indicate the occurrence of data points. The damped functions represent the 
smoothed versions of these functions (using the 5 df smoother). 
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either left alone, or shrunk to zero—that is, a projection matrix such 
as above has M eigenvalues equal to 1, and the rest are 0. For 
this reason smoothing splines are referred to as shrinking smoothers, 
while regression splines are projection smoothers (see Figure 3.17 on 
page 80). 

• The sequence of u/., ordered by decreasing pfc(A), appear to increase 
in complexity. Indeed, they have the zero-crossing behavior of polyno¬ 
mials of increasing degree. Since S^Ufc = pk( A)ufc, we see how each of 
the eigenvectors themselves are shrunk by the smoothing spline: the 
higher the complexity, the more they are shrunk. If the domain of X 
is periodic, then the u*, are sines and cosines at different frequencies. 

• The first two eigenvalues are always one, and they correspond to the 
two-dimensional eigenspace of functions linear in x (Exercise 5.11), 
which are never shrunk. 

• The eigenvalues p}.{ A) = 1/(1 + A dk) are an inverse function of the 
eigenvalues dk of the penalty matrix K, moderated by A; A controls 
the rate at which the pk{ A) decrease to zero. d\ = d 2 = 0 and again 
linear functions are not penalized. 

• One can reparametrize the smoothing spline using the basis vectors 
Uj, (the Demmler-Reinsch basis). In this case the smoothing spline 
solves 

min||y-U0|| 2 + A0 T D0, (5.21) 

0 

where U has columns u^. and D is a diagonal matrix with elements 
dk- 

• df^ = trace(S^) = J2k=i PkW- For projection smoothers, all the 
eigenvalues are 1, each one corresponding to a dimension of the pro¬ 
jection subspace. 

Figure 5.8 depicts a smoothing spline matrix, with the rows ordered with 
x. The banded nature of this representation suggests that a smoothing 
spline is a local fitting method, much like the locally weighted regression 
procedures in Chapter 6. The right panel shows in detail selected rows of 
S, which we call the equivalent kernels. As A — > 0, df\ — > N, and —> I, 
the TV-dimensional identity matrix. As A —> oo, dfA —> 2, and —> H, the 
hat matrix for linear regression on x. 


5.5 Automatic Selection of the Smoothing 
Parameters 


The smoothing parameters for regression splines encompass the degree of 
the splines, and the number and placement of the knots. For smoothing 
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Equivalent Kernels 


Smoother Matrix 



Row 12 



Row 25 


Row 50 


Row 75 



Row 100 



Row 115 




FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded, 
indicating an equivalent kernel with local support. The left panel represents the 
elements of S as an image. The right panel shows the equivalent kernel or weight¬ 
ing function in detail for the indicated rows. 
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splines, we have only the penalty parameter A to select, since the knots are 
at all the unique training X’s, and cubic degree is almost always used in 
practice. 

Selecting the placement and number of knots for regression splines can be 
a combinatorially complex task, unless some simplifications are enforced. 
The MARS procedure in Chapter 9 uses a greedy algorithm with some 
additional approximations to achieve a practical compromise. We will not 
discuss this further here. 

5.5.1 Fixing the Degrees of Freedom 

Since df A = trace(S A ) is monotone in A for smoothing splines, we can in¬ 
vert the relationship and specify A by fixing df. In practice this can be 
achieved by simple numerical methods. So, for example, in R one can use 
smooth, spline (x,y,df=6) to specify the amount of smoothing. This encour¬ 
ages a more traditional mode of model selection, where we might try a cou¬ 
ple of different values of df, and select one based on approximate F- tests, 
residual plots and other more subjective criteria. Using df in this way pro¬ 
vides a uniform approach to compare many different smoothing methods. 
It is particularly useful in generalized additive models (Chapter 9), where 
several smoothing methods can be simultaneously used in one model. 


5.5.2 The Bias-Variance Tradeoff 

Figure 5.9 shows the effect of the choice of df A when using a smoothing 
spline on a simple example: 


Y = f(X)+e, 

sin (12 (X + 0.2)) 


f(X) = 


(5.22) 


X + 0.2 


with X ~ U[ 0,1] and e ~ N( 0,1). Our training sample consists of N = 100 
pairs Xi,yi drawn independently from this model. 

The fitted splines for three different values of df* are shown. The yellow 
shaded region in the figure represents the pointwise standard error of f \, 
that is, we have shaded the region between f\(x) ± 2 • se(/ A (x)). Since 

f = S A y, 


Cov(f) = S A Cov(y)S A 


= s A sJ. 


(5.23) 


The diagonal contains the pointwise variances at the training x,. The bias 
is given by 


Bias(f) = f E(f) 

= f-s A f, 


(5.24) 



EPE(A) and CV(A) 
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Cross-Validation 


df* = 5 




df A =9 


df A = 15 




FIGURE 5.9. The top left panel shows the EPE(A) and CV(A) curves for a 
realization from a nonlinear additive error model (5.22). The remaining panels 
show the data, the true functions (in purple), and the fitted curves (in green) with 
yellow shaded ±2x standard error bands, for three different values of df x . 










160 


5. Basis Expansions and Regularization 


where f is the (unknown) vector of evaluations of the true / at the training 
X’s. The expectations and variances are with respect to repeated draws 
of samples of size N = 100 from the model (5.22). In a similar fashion 
Va,r(f\(xo )) and Bias(/ A (:Eo)) can be computed at any point Xo (Exer¬ 
cise 5.10). The three fits displayed in the figure give a visual demonstration 
of the bias-variance tradeoff associated with selecting the smoothing 
parameter. 


df A = 5: The spline under fits, and clearly trims down the hills and fills in 
the valleys. This leads to a bias that is most dramatic in regions of 
high curvature. The standard error band is very narrow, so we esti¬ 
mate a badly biased version of the true function with great reliability! 

df A = 9: Here the fitted function is close to the true function, although a 
slight amount of bias seems evident. The variance has not increased 
appreciably. 

clf A = 15: The fitted function is somewhat wiggly, but close to the true 
function. The wiggliness also accounts for the increased width of the 
standard error bands—the curve is starting to follow some individual 
points too closely. 


Note that in these figures we are seeing a single realization of data and 
hence fitted spline / in each case, while the bias involves an expectation 
E(/). We leave it as an exercise (5.10) to compute similar figures where the 
bias is shown as well. The middle curve seems “just right,” in that it has 
achieved a good compromise between bias and variance. 

The integrated squared prediction error (EPE) combines both bias and 
variance in a single summary: 


EPE(A) 


e(p-Apo ) 2 

Var(F) + E [Bias 2 (A(X)) + Var(AW) 

a 2 +MSE(A). 


(5.25) 


Note that this is averaged both over the training sample (giving rise to f\). 
and the values of the (independently chosen) prediction points (X, Y). EPE 
is a natural quantity of interest, and does create a tradeoff between bias 
and variance. The blue points in the top left panel of Figure 5.9 suggest 
that df A = 9 is spot on! 

Since we don’t know the true function, we do not have access to EPE, and 
need an estimate. This topic is discussed in some detail in Chapter 7, and 
techniques such as K-fold cross-validation, GCV and C p are all in common 
use. In Figure 5.9 we include the TV-fold (leave-one-out) cross-validation 


curve: 
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cv(/a) = ^E^-/A _i) R6) 2 

(5.26) 

1 V ( Vi ~ 

JV^^l- S x (i,i)J ’ 

(5.27) 


which can (remarkably) be computed for each value of A from the original 
fitted values and the diagonal elements of Sa (Exercise 5.13). 

The EPE and CV curves have a similar shape, but the entire CV curve 
is above the EPE curve. For some realizations this is reversed, and overall 
the CV curve is approximately unbiased as an estimate of the EPE curve. 


5.6 Nonparametric Logistic Regression 


The smoothing spline problem (5.9) in Section 5.4 is posed in a regression 
setting. It is typically straightforward to transfer this technology to other 
domains. Here we consider logistic regression with a single quantitative 
input X. The model is 


Pr(Y = 1\X = x ) 
Pr(Y = 0|X = x) 


f(x), 


(5.28) 


which implies 

Pr(r = l|X = I ) = T VV y . (5.29) 

Fitting f(x) in a smooth fashion leads to a smooth estimate of the condi¬ 
tional probability Pr(Y = 1 |cc), which can be used for classification or risk 
scoring. 

We construct the penalized log-likelihood criterion 


^(/; N = 5Z ivi l °sp( x i) + (! - Vi) lo g(! - p( x i))\ ~\ x J {/"(6} 2 * 

i =1 

N - r 

= [yrf( x i) - 1 °g( 1 + e/(Xi) ) - 2 X J {/"6)} 2d 6 (5-30) 


where we have abbreviated p(x) = Pr(V = l|a;). The first term in this ex¬ 
pression is the log-likelihood based on the binomial distribution (c.f. Chap¬ 
ter 4, page 120). Arguments similar to those used in Section 5.4 show that 
the optimal / is a finite-dimensional natural spline with knots at the unique 
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values of x. This means that we can represent f(x) = Xj(x)9j. We 

compute the first and second derivatives 

N T (y - p) - XflO, (5.31) 

-N T WN-Afi, (5.32) 

where p is the iV-vector with elements p(xi ), and W is a diagonal matrix 
of weights p(xi)( 1 — p{xi)). The first derivative (5.31) is nonlinear in 9, so 
we need to use an iterative algorithm as in Section 4.4.1. Using Newton- 
Raphson as in (4.23) and (4.26) for linear logistic regression, the update 
equation can be written 

9 new = (N t WN + Afi) _1 N T W (Nd old + W _1 (y — p)) 

= (N T WN + Af2)' 1 N T Wz. (5.33) 

We can also express this update in terms of the fitted values 

pew = n(n t WN + Af2)~ 1 N T W (f old + W _1 (y — p)) 

= S Aito z. (5.34) 

Referring back to (5.12) and (5.14), we see that the update fits a weighted 
smoothing spline to the working response z (Exercise 5.12). 

The form of (5.34) is suggestive. It is tempting to replace S\, w by any 
nonparametric (weighted) regression operator, and obtain general fami¬ 
lies of nonparametric logistic regression models. Although here x is one- 
dimensional, this procedure generalizes naturally to higher-dimensional x. 
These extensions are at the heart of generalized additive models , which we 
pursue in Chapter 9. 


dl{0) 

d9 

d 2 l{9) 

dOdO T 


5.7 Multidimensional Splines 

So far we have focused on one-dimensional spline models. Each of the ap¬ 
proaches have multidimensional analogs. Suppose X £ JR 2 , and we have 
a basis of functions hik(Xi), k = 1 ,... , Mi for representing functions of 
coordinate Xi, and likewise a set of M 2 functions h 2 &(M) for coordinate 
X 2 ■ Then the Mi x M 2 dimensional tensor product basis defined by 

g jk (X) = h lj (X 1 )h 2k (X 2 ), j = 1 ,..., Mi, k = 1,..., M 2 (5.35) 

can be used for representing a two-dimensional function: 

A^2 

»<*> = ££ 9jk9jk (A). 

3=1 fc=1 


(5.36) 
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FIGURE 5.10. A tensor product basis of B-splines, showing some selected pairs. 
Each two-dimensional function is the tensor product of the corresponding one 
dimensional marginals. 


Figure 5.10 illustrates a tensor product basis using B-splines. The coeffi¬ 
cients can be fit by least squares, as before. This can be generalized to d 
dimensions, but note that the dimension of the basis grows exponentially 
fast—yet another manifestation of the curse of dimensionality. The MARS 
procedure discussed in Chapter 9 is a greedy forward algorithm for includ¬ 
ing only those tensor products that are deemed necessary by least squares. 

Figure 5.11 illustrates the difference between additive and tensor product 
(natural) splines on the simulated classification example from Chapter 2. 
A logistic regression model logit[Pr(T|a;)] = h(x) T 6 is fit to the binary re¬ 
sponse, and the estimated decision boundary is the contour h{x) T 9 = 0. 
The tensor product basis can achieve more flexibility at the decision bound¬ 
ary, but introduces some spurious structure along the way. 
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Additive Natural Cubic Splines - 4 df each 



Natural Cubic Splines - Tensor Product - 4 df each 



FIGURE 5.11. The simulation example of Figure 2.1. The upper panel shows the 
decision boundary of an additive logistic regression model, using natural splines 
in each of the two coordinates (total df = 1 + (4 — 1) + (4 — 1) = 7). The lower 
panel shows the results of using a tensor product of natural spline bases in each 
coordinate (total df = 4 x 4 = 16). The broken purple boundary is the Bayes 
decision boundary for this problem. 
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One-dimensional smoothing splines (via regularization) generalize to high¬ 
er dimensions as well. Suppose we have pairs yi,Xi with Xi € IR , and we 
seek a d-dimensional regression function f(x). The idea is to set up the 
problem 

N 

minY^{yi-f(xi)} 2 + XJ[f], (5.37) 

* i—l 

where J is an appropriate penalty functional for stabilizing a function / in 
IR d . For example, a natural generalization of the one-dimensional roughness 
penalty (5.9) for functions on IR 2 is 


Af] 


f f \f d 2 f(x)\ 2 ( d 2 f(x)\ 2 

J Jri'-K dx\ ) \dxidx 2 ) 


( d 2 f{x) 

V dx l 


2 

dx\dx 2 ■ (5.38) 


Optimizing (5.37) with this penalty leads to a smooth two-dimensional 
surface, known as a thin-plate spline. It shares many properties with the 
one-dimensional cubic smoothing spline: 

• as A —> 0, the solution approaches an interpolating function [the one 
with smallest penalty (5.38)]; 

• as A —> oo, the solution approaches the least squares plane; 

• for intermediate values of A, the solution can be represented as a 
linear expansion of basis functions, whose coefficients are obtained 
by a form of generalized ridge regression. 

The solution has the form 


N 

f(x) = j3o + P T X + y^a J h j (x), 

3=1 


(5.39) 


where hj(x) = ||a: — ajj 11 2 log 11 a; — Xj |. These hj are examples of radial 
basis functions , which are discussed in more detail in the next section. The 
coefficients are found by plugging (5.39) into (5.37), which reduces to a 
finite-dimensional penalized least squares problem. For the penalty to be 
finite, the coefficients have to satisfy a set of linear constraints; see 
Exercise 5.14. 

Thin-plate splines are defined more generally for arbitrary dimension d, 
for which an appropriately more general J is used. 

There are a number of hybrid approaches that are popular in practice, 
both for computational and conceptual simplicity. Unlike one-dimensional 
smoothing splines, the computational complexity for thin-plate splines is 
0 {N 3 ), since there is not in general any sparse structure that can be ex¬ 
ploited. However, as with univariate smoothing splines, we can get away 
with substantially less than the N knots prescribed by the solution (5.39). 







166 


5. Basis Expansions and Regularization 


Systolic Blood Pressure 



FIGURE 5.12. A thin-plate spline fit to the heart disease data, displayed as a 
contour plot. The response is systolic blood pressure, modeled as a function 
of age and obesity. The data points are indicated, as well as the lattice of points 
used as knots. Care should be taken to use knots from the lattice inside the convex 
hull of the data (red), and ignore those outside (green). 


In practice, it is usually sufficient to work with a lattice of knots covering 
the domain. The penalty is computed for the reduced expansion just as 
before. Using K knots reduces the computations to 0(NK 2 + K 3 ). Fig¬ 
ure 5.12 shows the result of fitting a thin-plate spline to some heart disease 
risk factors, representing the surface as a contour plot. Indicated are the 
location of the input features, as well as the knots used in the fit. Note that 
A was specified via df> = trace(SA) = 15. 

More generally one can represent / £ IR^ as an expansion in any arbi¬ 
trarily large collection of basis functions, and control the complexity by ap¬ 
plying a regularizer such as (5.38). For example, we could construct a basis 
by forming the tensor products of all pairs of univariate smoothing-spline 
basis functions as in (5.35), using, for example, the univariate -B-splines 
recommended in Section 5.9.2 as ingredients. This leads to an exponential 
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growth in basis functions as the dimension increases, and typically we have 
to reduce the number of functions per coordinate accordingly. 

The additive spline models discussed in Chapter 9 are a restricted class 
of multidimensional splines. They can be represented in this general formu¬ 
lation as well; that is, there exists a penalty J[f] that guarantees that the 
solution has the form f{X) = a + fi(Xi) + ■ ■ ■ + fd(Xd) and that each of 
the functions fj are univariate splines. In this case the penalty is somewhat 
degenerate, and it is more natural to assume that / is additive, and then 
simply impose an additional penalty on each of the component functions: 

J\f] = J{h + h + --- + f d ) 

d 

= E 

3 = 1 

These are naturally extended to ANOVA spline decompositions, 

f(X) = a + Y, fjiXj) + J2 fjk{Xj, x k) + • • • , (5.41) 

3 3<k 

where each of the components are splines of the required dimension. There 
are many choices to be made: 

• The maximum order of interaction—we have shown up to order 2 
above. 

• Which terms to include—not all main effects and interactions are 
necessarily needed. 

• What representation to use—some choices are: 

— regression splines with a relatively small number of basis func¬ 
tions per coordinate, and their tensor products for interactions; 

— a complete basis as in smoothing splines, and include appropri¬ 
ate regularizers for each term in the expansion. 

In many cases when the number of potential dimensions (features) is large, 
automatic methods are more desirable. The MARS and MART procedures 
(Chapters 9 and 10, respectively), both fall into this category. 




(5.40) 


5.8 Regularization and Reproducing Kernel 
Hilbert Spaces 

In this section we cast splines into the larger context of regularization meth¬ 
ods and reproducing kernel Hilbert spaces. This section is quite technical 
and can be skipped by the disinterested or intimidated reader. 
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A general class of regularization problems has the form 

r n i 


min 


/(*<)) +A J(/) 

_i= 1 


(5.42) 


where L(y,f(x)) is a loss function, J(/) is a penalty functional, and H is 
a space of functions on which J(f) is defined. Girosi et al. (1995) describe 
quite general penalty functionals of the form 


J(f) = 



l/(*)P 

G(s) 


ds , 


(5.43) 


where / denotes the Fourier transform of /, and G is some positive function 
that falls off to zero as ||s|| —> oo. The idea is that 1/G increases the penalty 
for high-frequency components of /. Under some additional assumptions 
they show that the solutions have the form 


K 


N 


/(*) = £ o^kfkiX) + 9iG(X — Xi ), 

k =1 i —1 


(5.44) 


where the (j>k span the null space of the penalty functional J, and G is the 
inverse Fourier transform of G. Smoothing splines and thin-plate splines 
fall into this framework. The remarkable feature of this solution is that 
while the criterion (5.42) is defined over an infinite-dimensional space, the 
solution is finite-dimensional. In the next sections we look at some specific 
examples. 


5.8.1 Spaces of Functions Generated by Kernels 

An important subclass of problems of the form (5.42) are generated by 
a positive definite kernel K(x,y), and the corresponding space of func¬ 
tions Hk is called a reproducing kernel Hilbert space (RKHS). The penalty 
functional J is defined in terms of the kernel as well. We give a brief and 
simplified introduction to this class of models, adapted from Wahba (1990) 
and Girosi et al. (1995), and nicely summarized in Evgeniou et al. (2000). 

Let x, y £ 1R P . We consider the space of functions generated by the linear 
span of {K(-,y), y € 1R P )}; i.e arbitrary linear combinations of the form 
f(x ) = ’y2m a mK{x,y rn )i where each kernel term is viewed as a function 
of the first argument, and indexed by the second. Suppose that K has an 
eigen-expansion 

OO 

K i x ,y) = ( 5 - 45 ) 

2—1 

with 7 i > 0, 7 j 2 < oo. Elements of Hk have an expansion in terms of 

these eigen-functions, 

OO 

/(*) = ^2ciMx), 

2=1 


(5.46) 





5.8 Regularization and Reproducing Kernel Hilbert Spaces 169 


with the constraint that 

OO 

WfWuK d = < 00 ’ ( 5 - 47 ) 

i =1 


where ||/||« K is the norm induced by K. The penalty functional in (5.42) 
for the space Hk is defined to be the squared norm J(f) = ||/||^ The 
quantity </(/) can be interpreted as a generalized ridge penalty, where 
functions with large eigenvalues in the expansion (5.45) get penalized less, 
and vice versa. 

Rewriting (5.42) we have 


min 


" N 

^2 L (yiJ(xi)) + M\f\\n K 

_i=1 


or equivalently 


min 


N oo oo 

L (y (**)) + X Y1 


i=i j =i 


j= 1 


(5.48) 


(5.49) 


It can be shown (Wahba, 1990, see also Exercise 5.15) that the solution 
to (5.48) is finite-dimensional, and has the form 


N 

f(x ) = y ^ j a i K(x,Xi). (5.50) 

i—1 


The basis function hi(x) = K{x, xf) (as a function of the first argument) is 
known as the representer of evaluation at x* in Hk, since for / £ Hk, it is 
easily seen that (K(-, x»), f)n K = ffai). Similarly (K(-,Xi),K(-, Xj))u K = 
K(xi,Xj) (the reproducing property of Hk), and hence 

N N 

w)=EE K(xi,Xj)aiCij (5.51) 

i=1 3 = 1 

for /(x) = otiK{x,Xi). 

In light of (5.50) and (5.51), (5.48) reduces to a finite-dimensional crite¬ 
rion 

minL(y, Ka) + Acc t Kq:. (5.52) 

OL 

We are using a vector notation, in which K is the TV x N matrix with ijth 
entry K(xi,Xj) and so on. Simple numerical algorithms can be used to 
optimize (5.52). This phenomenon, whereby the infinite-dimensional prob¬ 
lem (5.48) or (5.49) reduces to a finite dimensional optimization problem, 
has been dubbed the kernel property in the literature on support-vector 
machines (see Chapter 12). 
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There is a Bayesian interpretation of this class of models, in which / 
is interpreted as a realization of a zero-mean stationary Gaussian process, 
with prior covariance function K. The eigen-decomposition produces a se¬ 
ries of orthogonal eigen-functions 4>j(x) with associated variances 7 j. The 
typical scenario is that “smooth” functions <f)j have large prior variance, 
while “rough” fj have small prior variances. The penalty in (5.48) is the 
contribution of the prior to the joint likelihood, and penalizes more those 
components with smaller prior variance (compare with (5.43)). 

For simplicity we have dealt with the case here where all members of H 
are penalized, as in (5.48). More generally, there may be some components 
in H that we wish to leave alone, such as the linear functions for cubic 
smoothing splines in Section 5.4. The multidimensional thin-plate splines 
of Section 5.7 and tensor product splines fall into this category as well. 
In these cases there is a more convenient representation H = Ho ® Hi, 
with the null space Ho consisting of, for example, low degree polynomi¬ 
als in x that do not get penalized. The penalty becomes J{f) = ||Pi/||, 
where P\ is the orthogonal projection of / onto H\. The solution has the 
form f(x) = Pjhj{x) + YliLi a iK(x, x i)> where the first term repre¬ 

sents an expansion in Ho- From a Bayesian perspective, the coefficients of 
components in H 0 have improper priors, with infinite variance. 


5.8.2 Examples of RKHS 

The machinery above is driven by the choice of the kernel K and the loss 
function L. We consider first regression using squared-error loss. In this 
case (5.48) specializes to penalized least squares, and the solution can be 
characterized in two equivalent ways corresponding to (5.49) or (5.52): 


N 


mm 

teir 


i=1 




00 „2 


+ aV 

3=1 13 


(5.53) 


an infinite-dimensional, generalized ridge regression problem, or 

min(y — Kcc) r (y — Kcc) + Aa r Ka. (5.54) 


The solution for ol is obtained simply as 


a = (K + AI)- 1 y, 


(5.55) 


and 


N 

f( x ) = ' 52 a j K(x,x j ). 

3 =1 


(5.56) 
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The vector of N fitted values is given by 

f = Ka 

= K(K + AI) _1 y (5.57) 

= (I + AKT 1 )-^. (5.58) 

The estimate (5.57) also arises as the kriging estimate of a Gaussian ran¬ 
dom field in spatial statistics (Cressie, 1993). Compare also (5.58) with the 
smoothing spline fit (5.17) on page 154. 

Penalized Polynomial Regression 

The kernel K(x,y) = ((a :,y) + l) d (Vapnik, 1996), for x,y € IR P , has 
M = ( p ~^ d ) eigen-functions that span the space of polynomials in 1R P of 
total degree d. For example, with p = 2 and d = 2, M = 6 and 

K(x,y) = 1 + 2x!y! + 2x 2 y 2 + xfyl + xly% + 2xix 2 yiy2 (5.59) 

M 

= ^2 h rn{x)hm(y) (5.60) 

m= 1 

with 

h{ x) T = (l,V2xi,V2x 2 ,xl,xl,V2xix 2 ). (5.61) 

One can represent h in terms of the M orthogonal eigen-functions and 
eigenvalues of K, 

h{ ®) = VD|^(®), (5.62) 

where D 7 = diag( 7 i, 7 2 ,... ,7m), and V is M x M and orthogonal. 
Suppose we wish to solve the penalized polynomial regression problem 

n / m \ 2 M 

min „EU-E^M*i) (5.63) 

{dm >i i=1 y m=1 ) m=1 

Substituting (5.62) into (5.63), we get an expression of the form (5.53) to 
optimize (Exercise 5.16). 

The number of basis functions M = ( p ~^ d ) can be very large, often much 
larger than N. Equation (5.55) tells us that if we use the kernel represen¬ 
tation for the solution function, we have only to evaluate the kernel N 2 
times, and can compute the solution in 0(N 3 ) operations. 

This simplicity is not without implications. Each of the polynomials h m 
in (5.61) inherits a scaling factor from the particular form of K, which has 
a bearing on the impact of the penalty in (5.63). We elaborate on this in 
the next section. 
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Radial Kernel in IR 1 



x 

FIGURE 5.13. Radial kernels kk(x) for the mixture data, with scale parameter 
vM 1. The kernels are centered at five points x m chosen at random from the 200. 

Gaussian Radial Basis Functions 

In the preceding example, the kernel is chosen because it represents an 
expansion of polynomials and can conveniently compute high-dimensional 
inner products. In this example the kernel is chosen because of its functional 
form in the representation (5.50). 

The Gaussian kernel K{x, y) = e~ v ^ x ~ v ^ along with squared-error loss, 
for example, leads to a regression model that is an expansion in Gaussian 
radial basis functions, 

k m (x) = e -‘'ll x ~ x ^\l 2 j m = 1,..., N, (5.64) 

each one centered at one of the training feature vectors x m . The coefficients 
are estimated using (5.54). 

Figure 5.13 illustrates radial kernels in IR 1 using the first coordinate of 
the mixture example from Chapter 2. We show five of the 200 kernel basis 
functions k m (x) = K(x,x m ). 

Figure 5.14 illustrates the implicit feature space for the radial kernel 
with x € IR 1 . We computed the 200 x 200 kernel matrix K, and its eigen- 
decomposition We can think of the columns of $ and the corre¬ 

sponding eigenvalues in D 7 as empirical estimates of the eigen expansion 
(5.45) 2 . Although the eigenvectors are discrete, we can represent them as 
functions on IR 1 (Exercise 5.17). Figure 5.15 shows the largest 50 eigenval¬ 
ues of K. The leading eigenfunctions are smooth, and they are successively 
more wiggly as the order increases. This brings to life the penalty in (5.49), 
where we see the coefficients of higher-order functions get penalized more 
than lower-order ones. The right panel in Figure 5.14 shows the correspond- 


2 The ti ll column of *1* is an estimate of (i>p, evaluated at each of the N observations. 
Alternatively, the ith row of ( I‘ is the estimated vector of basis functions evaluated 

at the point Xi. Although in principle, there can be infinitely many elements in <p, our 
estimate has at most N elements. 
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FIGURE 5.14. (Left panel) The first 16 normalized eigenvectors of K, the 
200 x 200 kernel matrix for the first coordinate of the mixture data. These are 
viewed as estimates tfii of the eigenfunctions in (5.45), and are represented as 
functions in 1R 1 with the observed values superimposed in color. They are arranged 
in rows, starting at the top left. (Right panel) Rescaled versions hi = °f 

the functions in the left panel, for which the kernel computes the “inner product. ” 



FIGURE 5.15. The largest 50 eigenvalues of K; all those beyond the 30th are 
effectively zero. 
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ing feature space representation of the eigenfunctions 

he(x) = ^ e 4>e(x), 1= (5.65) 

Note that (h(xi), h(xi>)) = K{xt : Xi>). The scaling by the eigenvalues quickly 
shrinks most of the functions down to zero, leaving an effective dimension 
of about 12 in this case. The corresponding optimization problem is a stan¬ 
dard ridge regression, as in (5.63). So although in principle the implicit 
feature space is infinite dimensional, the effective dimension is dramat¬ 
ically lower because of the relative amounts of shrinkage applied to each 
basis function. The kernel scale parameter v plays a role here as well; larger 
v implies more local k m functions, and increases the effective dimension of 
the feature space. See Hastie and Zhu (2006) for more details. 

It is also known (Girosi et ah, 1995) that a thin-plate spline (Section 5.7) 
is an expansion in radial basis functions, generated by the kernel 

K(x,y) = \\x — 2 /|| 2 log(||x — y||). (5.66) 

Radial basis functions are discussed in more detail in Section 6.7. 

Support Vector Classifiers 

The support vector machines of Chapter 12 for a two-class classification 
problem have the form f(x) = «o + XlX onK(x, x i)i where the parameters 
are chosen to minimize 


min I Xf 1 _ + 7“ Tk “ \ > (5-67) 

where yi £ {—1,1}, and [z}+ denotes the positive part of z. This can be 
viewed as a quadratic optimization problem with linear constraints, and 
requires a quadratic programming algorithm for its solution. The name 
support vector arises from the fact that typically many of the on = 0 [due 
to the piecewise-zero nature of the loss function in (5.67)], and so / is an 
expansion in a subset of the K{-,Xi). See Section 12.3.3 for more details. 


5.9 Wavelet Smoothing 

We have seen two different modes of operation with dictionaries of basis 
functions. With regression splines, we select a subset of the bases, using 
either subject-matter knowledge, or else automatically. The more adaptive 
procedures such as MARS (Chapter 9) can capture both smooth and non¬ 
smooth behavior. With smoothing splines, we use a complete basis, but 
then shrink the coefficients toward smoothness. 


5.9 Wavelet Smoothing 


175 


"06,35 

" 06,15 

05,15 

05,1 

04,9 

04,4 

03,5 

03,2 

02,3 

02,1 

01,0 


FIGURE 5.16. Some selected wavelets at different translations and dilations 
for the Haar and symmlet families. The functions have been scaled to suit the 
display. 


Haar Wavelets Symmlet-8 Wavelets 




Wavelets typically use a complete orthonormal basis to represent func¬ 
tions, but then shrink and select the coefficients toward a sparse represen¬ 
tation. Just as a smooth function can be represented by a few spline basis 
functions, a mostly flat function with a few isolated bumps can be repre¬ 
sented with a few (bumpy) basis functions. Wavelets bases are very popular 
in signal processing and compression, since they are able to represent both 
smooth and/or locally bumpy functions in an efficient way—a phenomenon 
dubbed time and frequency localization. In contrast, the traditional Fourier 
basis allows only frequency localization. 

Before we give details, let’s look at the Haar wavelets in the left panel 
of Figure 5.16 to get an intuitive idea of how wavelet smoothing works. 
The vertical axis indicates the scale (frequency) of the wavelets, from low 
scale at the bottom to high scale at the top. At each scale the wavelets are 
“packed in” side-by-side to completely fill the time axis: we have only shown 
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a selected subset. Wavelet smoothing fits the coefficients for this basis by 
least squares, and then thresholds (discards, filters) the smaller coefficients. 
Since there are many basis functions at each scale, it can use bases where 
it needs them and discard the ones it does not need, to achieve time and 
frequency localization. The Haar wavelets are simple to understand, but not 
smooth enough for most purposes. The symmlet wavelets in the right panel 
of Figure 5.16 have the same orthonormal properties, but are smoother. 

Figure 5.17 displays an NMR (nuclear magnetic resonance) signal, which 
appears to be composed of smooth components and isolated spikes, plus 
some noise. The wavelet transform, using a symmlet basis, is shown in the 
lower left panel. The wavelet coefficients are arranged in rows, from lowest 
scale at the bottom, to highest scale at the top. The length of each line 
segment indicates the size of the coefficient. The bottom right panel shows 
the wavelet coefficients after they have been thresholded. The threshold 
procedure, given below in equation (5.69), is the same soft-thresholding 
rule that arises in the lasso procedure for linear regression (Section 3.4.2). 
Notice that many of the smaller coefficients have been set to zero. The 
green curve in the top panel shows the back-transform of the thresholded 
coefficients: this is the smoothed version of the original signal. In the next 
section we give the details of this process, including the construction of 
wavelets and the thresholding rule. 


5.9.1 Wavelet Bases and the Wavelet Transform 

In this section we give details on the construction and filtering of wavelets. 
Wavelet bases are generated by translations and dilations of a single scal¬ 
ing function (f>{x ) (also known as the father). The red curves in Figure 5.18 
are the Haar and symmlet-8 scaling functions. The Haar basis is particu¬ 
larly easy to understand, especially for anyone with experience in analysis 
of variance or trees, since it produces a piecewise-constant representation. 
Thus if (j)(x ) = I(x S [0,1]), then <j>o,k(x) = <t>{x — k), k an integer, generates 
an orthonormal basis for functions with jumps at the integers. Call this ref¬ 
erence space Vo- The dilations (f>i tk (x) = y/2(f>(2x — k) form an orthonormal 
basis for a space V\ D Vo of functions piecewise constant on intervals of 
length \. In fact, more generally we have ■■■ D Vi D Vo D V_i D • • • where 
each Vj is spanned by fj yk = Vt^cflVx — k). 

Now to the definition of wavelets. In analysis of variance, we often rep¬ 
resent a pair of means ni and V 2 by their grand mean fi = +M 2 ), an d 

then a contrast a = |(a*i — ^ 2 )- A simplification occurs if the contrast a is 
very small, because then we can set it to zero. In a similar manner we might 
represent a function in Vj+i by a component in Vj plus the component in 
the orthogonal complement Wj of Vj to Vj+i, written as Vj + 1 = Vj ® Wj. 
The component in Wj represents detail , and we might wish to set some ele¬ 
ments of this component to zero. It is easy to see that the functions if{x—k) 
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NMR Signal 
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FIGURE 5.17. The top panel shows an NMR signal, with the wavelet-shrunk 
version superimposed in green. The lower left panel represents the wavelet trans¬ 
form of the original signal, down to V 4 , using the symmlet-8 basis. Each coeffi¬ 
cient is represented by the height (positive or negative) of the vertical bar. The 
lower right panel represents the wavelet coefficients after being shrunken using 
the waveshrink function in S-PLUS, which implements the SureShrink method 
of wavelet adaptation of Donoho and Johnstone. 
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Haar Basis 
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FIGURE 5.18. The Haar and symmlet father (scaling) wavelet (p(x) and mother 
wavelet ip(x). 


generated by the mother wavelet ip(x) = cp(2x)—(p(2x—l) form an orthonor¬ 
mal basis for Wq for the Haar family. Likewise ipj^ = 2^' 1 ip(2?x — k) form 
a basis for Wj. 

Now Vj. |_i = Vj ® Wj = Vj_i ® Wj- 1 ® Wj, so besides representing a 
function by its level-j detail and level-j rough components, the latter can 
be broken down to level-(j — 1) detail and rough, and so on. Finally we get 
a representation of the form Vj = Vq ® Wo ® W\ • • • ® Wj- Figure 5.16 
on page 175 shows particular wavelets %pj,k(x). 

Notice that since these spaces are orthogonal, all the basis functions are 
orthonormal. In fact, if the domain is discrete with N = 2 J (time) points, 
this is as far as we can go. There are 2 J basis elements at level j. and 
adding up, we have a total of 2 J — 1 elements in the Wj, and one in Vq. 
This structured orthonormal basis allows for a multiresolution analysis, 
which we illustrate in the next section. 

While helpful for understanding the construction above, the Haar basis 
is often too coarse for practical purposes. Fortunately, many clever wavelet 
bases have been invented. Figures 5.16 and 5.18 include the Daubechies 
symmlet-8 basis. This basis has smoother elements than the corresponding 
Haar basis, but there is a tradeoff: 

• Each wavelet has a support covering 15 consecutive time intervals, 
rather than one for the Haar basis. More generally, the symmlet-p 
family has a support of 2p — 1 consecutive intervals. The wider the 
support, the more time the wavelet has to die to zero, and so it can 
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achieve this more smoothly. Note that the effective support seems to 
be much narrower. 

• The symmlet-p wavelet ip(x) has p vanishing moments; that is, 

J %jj(x)x : ’dx = 0, j = 0,... ,p — 1. 

One implication is that any order-p polynomial over the N = 2 J times 
points is reproduced exactly in Vo (Exercise 5.18). In this sense Vo 
is equivalent to the null space of the smoothing-spline penalty. The 
Haar wavelets have one vanishing moment, and Vo can reproduce any 
constant function. 

The symmlet-p scaling functions are one of many families of wavelet 
generators. The operations are similar to those for the Haar basis: 

• If Vo is spanned by <p(x — k), then V\ D Vo is spanned by i,k(x) = 
\r2.<\>{2x—k') and cj>(x) = J2kez 5{k)(j)i t k{x), f° r some filter coefficients 
h(k). 

• Wo is spanned by ip(x) = (Cfoez with filter coefficients 

g (k) = (-l) 1-fc /i(l - k). 


5.9.2 Adaptive Wavelet Filtering 

Wavelets are particularly useful when the data are measured on a uniform 
lattice, such as a discretized signal, image, or a time series. We will focus on 
the one-dimensional case, and having N = 2 J lattice-points is convenient. 
Suppose y is the response vector, and W is the N x N orthonormal wavelet 
basis matrix evaluated at the N uniformly spaced observations. Then y* = 
W T y is called the wavelet transform of y (and is the full least squares 
regression coefficient). A popular method for adaptive wavelet fitting is 
known as SURE shrinkage (Stein Unbiased Risk Estimation, Donoho and 
Johnstone (1994)). We start with the criterion 

min||y-W0||2 + 2A||0|| 1 , (5.68) 

0 

which is the same as the lasso criterion in Chapter 3. Because W is or¬ 
thonormal, this leads to the simple solution: 

dj = sign(p*)(|y*| - A) + . (5.69) 

The least squares coefficients are translated toward zero, and truncated 
at zero. The fitted function (vector) is then given by the inverse wavelet 
transform f = WO. 
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A simple choice for A is A = o\j2 log N , where a is an estimate of the 
standard deviation of the noise. We can give some motivation for this choice. 
Since W is an orthonormal transformation, if the elements of y are white 
noise (independent Gaussian variates with mean 0 and variance er 2 ), then 
so are y*. Furthermore if random variables Z\, Z 2 ,..., Zjq are white noise, 
the expected maximum of \Zj\,j = 1,... ,1V is approximately ay/2logN. 
Hence all coefficients below a^/2\ogN are likely to be noise and are set to 
zero. 

The space W could be any basis of orthonormal functions: polynomials, 
natural splines or cosinusoids. What makes wavelets special is the particular 
form of basis functions used, which allows for a representation localized in 
time and in frequency. 

Let’s look again at the NMR signal of Figure 5.17. The wavelet transform 
was computed using a symmlet—8 basis. Notice that the coefficients do not 
descend all the way to Vo, but stop at V 4 which has 16 basis functions. 
As we ascend to each level of detail, the coefficients get smaller, except in 
locations where spiky behavior is present. The wavelet coefficients represent 
characteristics of the signal localized in time (the basis functions at each 
level are translations of each other) and localized in frequency. Each dilation 
increases the detail by a factor of two, and in this sense corresponds to 
doubling the frequency in a traditional Fourier representation. In fact, a 
more mathematical understanding of wavelets reveals that the wavelets at 
a particular scale have a Fourier transform that is restricted to a limited 
range or octave of frequencies. 

The shrinking/truncation in the right panel was achieved using the SURE 
approach described in the introduction to this section. The orthonormal 
N x N basis matrix W has columns which are the wavelet basis functions 
evaluated at the N time points. In particular, in this case there will be 16 
columns corresponding to the cf> 4 ,k{x ), and the remainder devoted to the 
ipj t k{x), j = 4,..., 11. In practice A depends on the noise variance, and has 
to be estimated from the data (such as the variance of the coefficients at 
the highest level). 

Notice the similarity between the SURE criterion (5.68) on page 179, 
and the smoothing spline criterion (5.21) on page 156: 

• Both are hierarchically structured from coarse to fine detail, although 
wavelets are also localized in time within each resolution level. 

• The splines build in a bias toward smooth functions by imposing 
differential shrinking constants d Early versions of SURE shrinkage 
treated all scales equally. The S+wavelets function waveshrinkO has 
many options, some of which allow for differential shrinkage. 

• The spline L 2 penalty cause pure shrinkage, while the SURE L\ 
penalty does shrinkage and selection. 
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More generally smoothing splines achieve compression of the original signal 
by imposing smoothness, while wavelets impose sparsity. Figure 5.19 com¬ 
pares a wavelet fit (using SURE shrinkage) to a smoothing spline fit (using 
cross-validation) on two examples different in nature. For the NMR data in 
the upper panel, the smoothing spline introduces detail everywhere in order 
to capture the detail in the isolated spikes; the wavelet fit nicely localizes 
the spikes. In the lower panel, the true function is smooth, and the noise is 
relatively high. The wavelet fit has let in some additional and unnecessary 
wiggles—a price it pays in variance for the additional adaptivity. 

The wavelet transform is not performed by matrix multiplication as in 
y* = W T y. In fact, using clever pyramidal schemes y* can be obtained 
in O(N) computations, which is even faster than the Nlog(N) of the fast 
Fourier transform (FFT). While the general construction is beyond the 
scope of this book, it is easy to see for the Haar basis (Exercise 5.19). 
Likewise, the inverse wavelet transform Wd is also O(N). 

This has been a very brief glimpse of this vast and growing field. There is 
a very large mathematical and computational base built on wavelets. Mod¬ 
ern image compression is often performed using two-dimensional wavelet 
represent at ions. 


Bibliographic Notes 

Splines and R-splines are discussed in detail in de Boor (1978). Green 
and Silverman (1994) and Wahba (1990) give a thorough treatment of 
smoothing splines and thin-plate splines; the latter also covers reproducing 
kernel Hilbert spaces. See also Girosi et al. (1995) and Evgeniou et al. 
(2000) for connections between many nonparametric regression techniques 
using RKHS approaches. Modeling functional data, as in Section 5.2.3, is 
covered in detail in Ramsay and Silverman (1997). 

Daubechies (1992) is a classic and mathematical treatment of wavelets. 
Other useful sources are Chui (1992) and Wickerhauser (1994). Donoho and 
Johnstone (1994) developed the SURE shrinkage and selection technology 
from a statistical estimation framework; see also Vidakovic (1999). Bruce 
and Gao (1996) is a useful applied introduction, which also describes the 
wavelet software in S-PLUS. 


Exercises 


Ex. 5.1 Show that the truncated power basis functions in (5.3) represent a 
basis for a cubic spline with the two knots as indicated. 
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NMR Signal 



Smooth Function (Simulated) 


FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two 
examples. Each panel compares the SURE-shrunk wavelet fit to the cross-validated 
smoothing spline fit. 
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Ex. 5.2 Suppose that B^ix) is an order-M B-splinc defined in the Ap¬ 
pendix on page 186 through the sequence (5.77)-(5.78). 

(a) Show by induction that Bjjvr (a;) = 0 for x ^ [r^Ti+M]- This shows, for 

example, that the support of cubic B-splines is at most 5 knots. 

(b) Show by induction that B^m(^) > 0 for x € (ri,Ti+M)- The B-splines 

are positive in the interior of their support. 

(c) Show by induction that = 1 Va: € [£o,£/c+i]- 

(d) Show that B^m is a piecewise polynomial of order M (degree M — 1) 

on [Co,£if+i]> with breaks only at the knots £i,... ,£at- 

(e) Show that an order-M B-spline basis function is the density function 

of a convolution of M uniform random variables. 

Ex. 5.3 Write a program to reproduce Figure 5.3 on page 145. 

Ex. 5.4 Consider the truncated power series representation for cubic splines 
with K interior knots. Let 


3 K 

f(X) = J2h X1 +J2 dk ( X ~tk) 3 + . (5.70) 

j =0 k—1 


Prove that the natural boundary conditions for natural cubic splines (Sec¬ 
tion 5.2.1) imply the following linear constraints on the coefficients: 


ft = 0 , £f=i e k = 0 , 
03 = o, £f = i 6A = o. 


(5.71) 


Hence derive the basis (5.4) and (5.5). 

Ex. 5.5 Write a program to classify the phoneme data using a quadratic dis¬ 
criminant analysis (Section 4.3). Since there are many correlated features, 
you should filter them using a smooth basis of natural cubic splines (Sec¬ 
tion 5.2.3). Decide beforehand on a series of five different choices for the 
number and position of the knots, and use tenfold cross-validation to make 
the final selection. The phoneme data are available from the book website 
www-stat.stanford.edu/ElemStatLearn. 

Ex. 5.6 Suppose you wish to fit a periodic function, with a known period T. 
Describe how you could modify the truncated power series basis to achieve 
this goal. 

Ex. 5.7 Derivation of smoothing splines (Green and Silverman, 1994). Sup¬ 
pose that N > 2, and that g is the natural cubic spline interpolant to the 
pairs {xi,Zi}^, with a < x\ <■■■ < Xn < b. This is a natural spline 
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with a knot at every xf, being an TV-dimensional space of functions, we can 
determine the coefficients such that it interpolates the sequence Z{ exactly. 
Let g be any other differentiable function on [a, b] that interpolates the TV 
pairs. 

(a) Let h(x ) = g(x) — g(x). Use integration by parts and the fact that g is 
a natural cubic spline to show that 


g"{x)h"{x)dx 


N-l 

~ 9 "'(x+){h(x j+ 1 ) - h(xj)} (5.72) 

o =i 

0 . 


(b) Hence show that 

f g"(t) 2 dt > [ g"(tfdt, 

J a J a 

and that equality can only hold if h is identically zero in [a, b]. 

(c) Consider the penalized least squares problem 


mm 

/ 


n b 

~ /Ou)) 2 + A / f"{t) 2 dt 

i=l Ja 


Use (b) to argue that the minimizer must be a cubic spline with knots 
at each of the Xi. 


Ex. 5.8 In the appendix to this chapter we show how the smoothing spline 
computations could be more efficiently carried out using a (TV + 4) dimen¬ 
sional basis of B-splines. Describe a slightly simpler scheme using a (TV+ 2) 
dimensional H-spline basis defined on the TV — 2 interior knots. 

Ex. 5.9 Derive the Reinsch form Sa = (I + AK) -1 for the smoothing spline. 

Ex. 5.10 Derive an expression for Var(/A(xo)) and bias(/A(a;o))- Using the 
example (5.22), create a version of Figure 5.9 where the mean and several 
(pointwise) quantiles of f\{x) are shown. 

Ex. 5.11 Prove that for a smoothing spline the null space of K is spanned 
by functions linear in X. 


Ex. 5.12 Characterize the solution to the following problem, 

N 

minRSS(/, A) = '^2,w i {y i - f(xi)} 2 + A 

* i=1 

where the Wi > 0 are observation weights. 

Characterize the solution to the smoothing spline problem (5.9) when 
the training data have ties in X. 


J {f"(t)} 2 dt , (5.73) 
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Ex. 5.13 You have fitted a smoothing spline f\ to a sample of N pairs 
(. Xi,yt ). Suppose you augment your original sample with the pair xo, f\(xo), 
and refit; describe the result. Use this to derive the IV-fold cross-validation 
formula (5.26). 

Ex. 5.14 Derive the constraints on the Oj in the thin-plate spline expan¬ 
sion (5.39) to guarantee that the penalty J(f) is finite. How else could one 
ensure that the penalty was finite? 

Ex. 5.15 This exercise derives some of the results quoted in Section 5.8.1. 
Suppose K(x,y) satisfying the conditions (5.45) and let /(x) £ Hk- Show 
that 

(a) ( K(-,Xi),f)n K = f(xi). 

(b) (A (*, Xj), itT(*, Xj))-^ K K{x. t . Xj ). 

(c) If g(x) = J^iLi ctiK(x,Xi ), then 

N N 

J{g) = y ^y^ j K(x i ,x j )a i a j . 

»=i o=l 

Suppose that g(x) = g(x) + p(x), with p{x) £ Hk, and orthogonal in TLk 
to each of K(x, Xj), i = 1,..., N. Show that 

(d) 

N N 

^2 L (Vi, 9{xi)) + XJ (g) > ^2 L (yi, g(xi)) + M (g) (5-74) 

2=1 2=1 

with equality iff p{x) = 0. 

Ex. 5.16 Consider the ridge regression problem (5.53), and assume M > N. 
Assume you have a kernel K that computes the inner product K (x, y) = 

E!=i h m (x)h m (y). 

(a) Derive (5.62) on page 171 in the text. How would you compute the 
matrices V and D 7 , given K1 Hence show that (5.63) is equivalent 
to (5.53). 

(b) Show that 


f = H {3 

= K(K + AI) -1 y, (5.75) 


where H is the N x M matrix of evaluations h m (xi), and K = HH r 
the N x N matrix of inner-products h(xi) T h(xj). 
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(c) Show that 


/(x) = h(x) T (3 

N 

= y^-K~(ar,a?t)ai (5.76) 

i= 1 

and a = (K + AI) _1 y. 

(d) How would you modify your solution if M < N ? 

Ex. 5.17 Show how to convert the discrete eigen-decomposition of K in 
Section 5.8.2 to estimates of the eigenfunctions of K. 

Ex. 5.18 The wavelet function ijj(x) of the symmlet-p wavelet basis has 
vanishing moments up to order p. Show that this implies that polynomials 
of order p are represented exactly in Vq, defined on page 176. 

Ex. 5.19 Show that the Haar wavelet transform of a signal of length N = 2 J 
can be computed in O(N) computations. 


Appendix: Computations for Splines 

In this Appendix, we describe the 13-spline basis for representing polyno¬ 
mial splines. We also discuss their use in the computations of smoothing 
splines. 



B-splines 

Before we can get started, we need to augment the knot sequence defined 
in Section 5.2. Let £o < £i and < £k+ i be two boundary knots, which 
typically define the domain over which we wish to evaluate our spline. We 
now define the augmented knot sequence r such that 

• n < t 2 < • • ■ < t m < C; 

• Tj + M = £j > j = 1; ■ ■ ■ ) K\ 

• Cf+1 < Tk+M+1 < Tk+M+2 < ■ ■ • < TK+2M- 

The actual values of these additional knots beyond the boundary are arbi¬ 
trary, and it is customary to make them all the same and equal to £o and 
^k+ i, respectively. 

Denote by the ith H-spline basis function of order m for the 

knot-sequence r, m < M. They are defined recursively in terms of divided 
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differences as follows: 




1 if Ti < X < T i+ 1 

0 otherwise 


(5.77) 


for i = 1,..., K + 2 M — 1. These are also known as Haar basis functions. 


D /\ _ X Ti / \ . X TD / \ 

— Bi m— 1\%) H 1 (*^) 

7z+m—1 7"z 7"z+m T i+ i 

for i = 1,..., K + 2 M — m. 

(5.78) 

Thus with M = 4, B,;^, i = 1, • • ■ , K + 4 are the K + 4 cubic B-spline 

basis functions for the knot sequence £. This recursion can be contin¬ 

ued and will generate the B-spline basis for any order spline. Figure 5.20 
shows the sequence of B-splines up to order four with knots at the points 
0.0,0.1,..., 1.0. Since we have created some duplicate knots, some care 
has to be taken to avoid division by zero. If we adopt the convention 

that Bi i = 0 if Ti = r,;+i, then by induction Bi <m = 0 if r, = r^+i = 

... = Ti+ m . Note also that in the construction above, only the subset 
Bi, m , i = M — m + 1,M + K are required for the B-spline basis 
of order m < M with knots £. 

To fully understand the properties of these functions, and to show that 
they do indeed span the space of cubic splines for the knot sequence, re¬ 
quires additional mathematical machinery, including the properties of di¬ 
vided differences. Exercise 5.2 explores these issues. 

The scope of i?-splines is in fact bigger than advertised here, and has to 
do with knot duplication. If we duplicate an interior knot in the construc¬ 
tion of the r sequence above, and then generate the B -spline sequence as 
before, the resulting basis spans the space of piecewise polynomials with 
one less continuous derivative at the duplicated knot. In general, if in ad¬ 
dition to the repeated boundary knots, we include the interior knot 
1 < rj < M times, then the lowest-order derivative to be discontinuous 
at x = £j will be order M — r 3 . Thus for cubic splines with no repeats, 
rj = 1, j = 1, • ■ •, K , and at each interior knot the third derivatives (4— 1) 
are discontinuous. Repeating the j th knot three times leads to a discontin¬ 
uous 1st derivative; repeating it four times leads to a discontinuous zeroth 
derivative, i.e., the function is discontinuous at x = . This is exactly what 

happens at the boundary knots; we repeat the knots M times, so the spline 
becomes discontinuous at the boundary knots (i.e., undefined beyond the 
boundary). 

The local support of B-splines has important computational implica¬ 
tions, especially when the number of knots K is large. Least squares com¬ 
putations with N observations and K + M variables (basis functions) take 
0(N(K + M) 2 + (K + M) 3 ) flops (floating point operations.) If K is some 
appreciable fraction of N, this leads to 0(N 3 ) algorithms which becomes 
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B-splines of Order 1 



0.0 0.2 0.4 0.6 0.8 1.0 

B-splines of Order 2 



0.0 0.2 0.4 0.6 0.8 1.0 


B-splines of Order 4 



0.0 0.2 0.4 0.6 0.8 1.0 


FIGURE 5.20. The sequence of B-splines up to order four with ten knots evenly 
spaced from 0 to 1. The B-splines have local support; they are nonzero on an 
interval spanned by M + 1 knots. 














































Appendix: Computations for Splines 189 


unacceptable for large TV. If the TV observations are sorted, the TV x ( K+M) 
regression matrix consisting of the K + M B-spline basis functions evalu¬ 
ated at the N points has many zeros, which can be exploited to reduce the 
computational complexity back to O(N). We take this up further in the 
next section. 


Computations for Smoothing Splines 

Although natural splines (Section 5.2.1) provide a basis for smoothing 
splines, it is computationally more convenient to operate in the larger space 
of unconstrained B-splines. We write /( x) = where 7 j are 

coefficients and the Bj are the cubic B-spline basis functions. The solution 
looks the same as before, 

7 =(B T B + An B )- 1 B r y, (5.79) 

except now the N x N matrix N is replaced by the N x ( N + 4) matrix 
B, and similarly the ( N + 4) x (TV + 4) penalty matrix £Ib replaces the 
TV x TV dimensional V2 ,y. Although at face value it seems that there are 
no boundary derivative constraints, it turns out that the penalty term 
automatically imposes them by giving effectively infinite weight to any non 
zero derivative beyond the boundary. In practice, 7 is restricted to a linear 
subspace for which the penalty is always finite. 

Since the columns of B are the evaluated B-splines, in order from left 
to right and evaluated at the sorted values of X, and the cubic B-splines 
have local support, B is lower 4-banded. Consequently the matrix M = 
(B t B + AO) is 4-banded and hence its Cholesky decomposition M = LL T 
can be computed easily. One then solves LL T 7 = B T y by back-substitution 
to give 7 and hence the solution / in 0(N) operations. 

In practice, when N is large, it is unnecessary to use all N interior knots, 
and any reasonable thinning strategy will save in computations and have 
negligible effect on the fit. For example, the smooth.spline function in S- 
PLUS uses an approximately logarithmic strategy: if N < 50 all knots are 
included, but even at N = 5,000 only 204 knots are used. 
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6 

Kernel Smoothing Methods 


This is page 191 
Printer: Opaque this 


In this chapter we describe a class of regression techniques that achieve 
flexibility in estimating the regression function f(X) over the domain IR P 
by fitting a different but simple model separately at each query point xq. 
This is done by using only those observations close to the target point xq to 
fit the simple model, and in such a way that the resulting estimated function 
/(X) is smooth in 1R P . This localization is achieved via a weighting function 
or kernel K\(xq, Xi ), which assigns a weight to Xi based on its distance from 
Xq. The kernels K\ are typically indexed by a parameter A that dictates 
the width of the neighborhood. These memory-based, methods require in 
principle little or no training; all the work gets done at evaluation time. 
The only parameter that needs to be determined from the training data is 
A. The model, however, is the entire training data set. 

We also discuss more general classes of kernel-based techniques , which 
tie in with structured methods in other chapters, and are useful for density 
estimation and classification. 

The techniques in this chapter should not be confused with those asso¬ 
ciated with the more recent usage of the phrase “kernel methods”. In this 
chapter kernels are mostly used as a device for localization. We discuss ker¬ 
nel methods in Sections 5.8, 14.5.4, 18.5 and Chapter 12; in those contexts 
the kernel computes an inner product in a high-dimensional (implicit) fea¬ 
ture space, and is used for regularized nonlinear modeling. We make some 
connections to the methodology in this chapter at the end of Section 6.7. 
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Nearest-Neighbor Kernel Epanechnikov Kernel 




FIGURE 6.1. In each panel 100 pairs Xi, yi are generated at random from the 
blue curve with Gaussian errors: Y = sin(4X)+e, X ~ [7[0,1], e ~ JV(0,1/3). In 
the left panel the green curve is the result of a 30 -nearest-neighbor running-mean 
smoother. The red point is the fitted constant f(x o), and the red circles indicate 
those observations contributing to the fit at xo. The solid yellow region indicates 
the weights assigned to observations. In the right panel, the green curve is the 
kernel-weighted average, using an Epanechnikov kernel with (half) window width 
A = 0.2. 


6.1 One-Dimensional Kernel Smoothers 

In Chapter 2, we motivated the k -nearest-neighbor average 

/(x) = Ave(yi\xi G N k (x)) (6.1) 

as an estimate of the regression function E(F|AT = x). Here N k (x) is the set 
of k points nearest to x in squared distance, and Ave denotes the average 
(mean). The idea is to relax the definition of conditional expectation, as 
illustrated in the left panel of Figure 6.1, and compute an average in a 
neighborhood of the target point. In this case we have used the 30-nearest 
neighborhood -the fit at Xo is the average of the 30 pairs whose Xi values 
are closest to Xq. The green curve is traced out as we apply this definition 
at different values xo- The green curve is bumpy, since /(x) is discontinuous 
in x. As we move Xq from left to right, the fc-nearest neighborhood remains 
constant, until a point Xj to the right of x’o becomes closer than the furthest 
point xp in the neighborhood to the left of xo, at which time x* replaces xp. 
The average in (6.1) changes in a discrete way, leading to a discontinuous 

f( x )- 

This discontinuity is ugly and unnecessary. Rather than give all the 
points in the neighborhood equal weight, we can assign weights that die 
off smoothly with distance from the target point. The right panel shows 
an example of this, using the so-called Nadaraya-Watson kernel-weighted 
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average 


f(x„) ££i *»(*>.*<>!* 



(6.2) 


with the Epanechnikov quadratic kernel 



(6.3) 


with 


m 


3 

4 


(1 ~t 2 ) if \t\ < 1; 


(6.4) 


0 


otherwise. 


The fitted function is now continuous, and quite smooth in the right panel 
of Figure 6.1. As we move the target from left to right, points enter the 
neighborhood initially with weight zero, and then their contribution slowly 
increases (see Exercise 6.1). 

In the right panel we used a metric window size A = 0.2 for the kernel 
fit, which does not change as we move the target point Xq, while the size 
of the 30-nearest-neighbor smoothing window adapts to the local density 
of the Xi . One can, however, also use such adaptive neighborhoods with 
kernels, but we need to use a more general notation. Let h\(x o) be a width 
function (indexed by A) that determines the width of the neighborhood at 
Xq. Then more generally we have 



(6.5) 


In (6.3), h\(xo) = A is constant. For fc-nearest neighborhoods, the neigh¬ 
borhood size k replaces A, and we have h^{x$) = |xo — (C[fc]| where xua is 
the fcth closest aq to Xq. 

There are a number of details that one has to attend to in practice: 

• The smoothing parameter A, which determines the width of the local 
neighborhood, has to be determined. Large A implies lower variance 
(averages over more observations) but higher bias (we essentially as¬ 
sume the true function is constant within the window). 

• Metric window widths (constant h\(x)) tend to keep the bias of the 
estimate constant, but the variance is inversely proportional to the 
local density. Nearest-neighbor window widths exhibit the opposite 
behavior; the variance stays constant and the absolute bias varies 
inversely with local density. 

• Issues arise with nearest-neighbors when there are ties in the Xi. With 
most smoothing techniques one can simply reduce the data set by 
averaging the j/j at tied values of X , and supplementing these new 
observations at the unique values of Xi with an additional weight Wi 
(which multiples the kernel weight). 
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FIGURE 6.2. A comparison of three popular kernels for local smoothing. Each 
has been calibrated to integrate to 1. The tri-cube kernel is compact and has two 
continuous derivatives at the boundary of its support, while the Epanechnikov ker¬ 
nel has none. The Gaussian kernel is continuously differentiable, but has infinite 
support. 

• This leaves a more general problem to deal with: observation weights 
Wi. Operationally we simply multiply them by the kernel weights be¬ 
fore computing the weighted average. With nearest neighborhoods, it 
is now natural to insist on neighborhoods with a total weight content 
k (relative to Y^ w i)- I n the event of overflow (the last observation 
needed in a neighborhood has a weight Wj which causes the sum of 
weights to exceed the budget k ), then fractional parts can be used. 

• Boundary issues arise. The metric neighborhoods tend to contain less 
points on the boundaries, while the nearest-neighborhoods get wider. 

• The Epanechnikov kernel has compact support (needed when used 
with nearest-neighbor window size). Another popular compact kernel 
is based on the tri-cube function 



( 6 . 6 ) 


This is flatter on the top (like the nearest-neighbor box) and is differ¬ 
entiable at the boundary of its support. The Gaussian density func¬ 
tion D{t ) = (f>{t ) is a popular noncompact kernel, with the standard- 
deviation playing the role of the window size. Figure 6.2 compares 
the three. 

6.1.1 Local Linear Regression 

We have progressed from the raw moving average to a smoothly varying 
locally weighted average by using kernel weighting. The smooth kernel fit 
still has problems, however, as exhibited in Figure 6.3 (left panel). Locally- 
weighted averages can be badly biased on the boundaries of the domain, 
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Local Linear Regression at Boundary 



FIGURE 6.3. The locally weighted average has bias problems at or near the 
boundaries of the domain. The true function is approximately linear here, but 
most of the observations in the neighborhood have a higher mean than the target 
point, so despite weighting, their mean will be biased upwards. By fitting a locally 
weighted linear regression (right panel), this bias is removed to first order. 


because of the asymmetry of the kernel in that region. By fitting straight 
lines rather than constants locally, we can remove this bias exactly to first 
order; see Figure 6.3 (right panel). Actually, this bias can be present in the 
interior of the domain as well, if the X values are not equally spaced (for 
the same reasons, but usually less severe). Again locally weighted linear 
regression will make a first-order correction. 

Locally weighted regression solves a separate weighted least squares prob¬ 
lem at each target point Xq: 

N 

min V' K\(x 0 ,Xi) [y* - a(x 0 ) - l3(x 0 )xi\ 2 . (6.7) 

«Oo),/30o) 

The estimate is then f(x o) = a(xo) + /3(xo)xq. Notice that although we fit 
an entire linear model to the data in the region, we only use it to evaluate 
the fit at the single point xq. 

Define the vector-valued function b(x) T = (l,x). Let B be the N x 2 
regression matrix with Ah row b(xi ) T , and W(xo) the N x N diagonal 
matrix with Ah diagonal element K\(xo,Xi). Then 

/(*o) = ^o) T (B T W(xo)B)- 1 B T W(xo)y (6.8) 

N 

= y^Jj{x 0 )yi. (6.9) 

i=l 

Equation (6.8) gives an explicit expression for the local linear regression 
estimate, and (6.9) highlights the fact that the estimate is linear in the 
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Local Linear Equivalent Kernel at Boundary 



Local Linear Equivalent Kernel in Interior 



FIGURE 6.4. The green points show the equivalent kernel h(x o) for local re¬ 
gression. These are the weights in f(x o) = ^2, i= iU(_ x o)yi, plotted against their 
corresponding Xi. For display purposes, these have been rescaled, since in fact 
they sum to 1. Since the yellow shaded region is the (rescaled) equivalent kernel 
for the Nadaraya- Watson local average, we see how local regression automati¬ 
cally modifies the weighting kernel to correct for biases due to asymmetry in the 
smoothing window. 


Hi (the li(x o) do not involve y). These weights U(x o) combine the weight¬ 
ing kernel K\(x o,-) and the least squares operations, and are sometimes 
referred to as the equivalent kernel. Figure 6.4 illustrates the effect of lo¬ 
cal linear regression on the equivalent kernel. Historically, the bias in the 
Nadaraya-Watson and other local average kernel methods were corrected 
by modifying the kernel. These modifications were based on theoretical 
asymptotic mean-square-error considerations, and besides being tedious to 
implement, are only approximate for finite sample sizes. Local linear re¬ 
gression automatically modifies the kernel to correct the bias exactly to 
first order, a phenomenon dubbed as automatic kernel carpentry. Consider 
the following expansion for E/(a;o), using the linearity of local regression 
and a series expansion of the true function / around Xq, 

N 

E/(:r 0 ) = ^ ~2k(x 0 )f(xi ) 

2—1 

N N 

= o) ^2 ^( x o) + f'{x o)^2(Xi - x 0 )li(x 0 ) 

2—1 2=1 

+ / (^°) y ^(x i - Xo ) 2 l i (xo) + R, (6.10) 

i. — 1 

where the remainder term R involves third- and higher-order derivatives of 
/, and is typically small under suitable smoothness assumptions. It can be 
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Local Linear in Interior Local Quadratic in Interior 




FIGURE 6.5. Local linear fits exhibit bias in regions of curvature of the true 
function. Local quadratic fits tend to eliminate this bias. 


shown (Exercise 6.2) that for local linear regression, YliLi^i( x o) = 1 and 
— %o)h(xo) = 0- Hence the middle term equals f(x o), and since 
the bias is E/(xo) — f(x o), we see that it depends only on quadratic and 
higher-order terms in the expansion of /. 


6.1.2 Local Polynomial Regression 

Why stop at local linear fits? We can fit local polynomial fits of any de¬ 
gree d , 


N 


min 

a(x o ),0j(xo), j=l,...,d 


y K x (x 0 ,Xi) 
1=1 


Vi ~ a(x 0 ) -yPj{x 0 )xl 

j =i 


( 6 . 11 ) 


with solution f(x o) = d(xo) + ]Cj=i Pj( x o) x o- In fact, an expansion such as 
(6.10) will tell us that the bias will only have components of degree d+1 and 
higher (Exercise 6.2). Figure 6.5 illustrates local quadratic regression. Local 
linear fits tend to be biased in regions of curvature of the true function, a 
phenomenon referred to as trimming the hills and filling the valleys. Local 
quadratic regression is generally able to correct this bias. 

There is of course a price to be paid for this bias reduction, and that is 
increased variance. The fit in the right panel of Figure 6.5 is slightly more 
wiggly, especially in the tails. Assuming the model yi = f(xi) + £j, with 
£j independent and identically distributed with mean zero and variance 
cr 2 , Var(/(x 0 )) = cr 2 ||i(x 0 )|| 2 , where l(x 0 ) is the vector of equivalent kernel 
weights at Xq. It can be shown (Exercise 6.3) that ||Z(cco)11 increases with d, 
and so there is a bias-variance tradeoff in selecting the polynomial degree. 
Figure 6.6 illustrates these variance curves for degree zero, one and two 
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FIGURE 6.6. The variances functions ||l(a;)|| 2 for local constant, linear and 
quadratic regression, for a metric bandwidth (X = 0.2) tri-cube kernel. 


local polynomials. To summarize some collected wisdom on this issue: 

• Local linear fits can help bias dramatically at the boundaries at a 
modest cost in variance. Local quadratic fits do little at the bound¬ 
aries for bias, but increase the variance a lot. 

• Local quadratic fits tend to be most helpful in reducing bias due to 
curvature in the interior of the domain. 

• Asymptotic analysis suggest that local polynomials of odd degree 
dominate those of even degree. This is largely due to the fact that 
asymptotically the MSE is dominated by boundary effects. 

While it may be helpful to tinker, and move from local linear fits at the 
boundary to local quadratic fits in the interior, we do not recommend such 
strategies. Usually the application will dictate the degree of the fit. For 
example, if we are interested in extrapolation, then the boundary is of 
more interest, and local linear fits are probably more reliable. 


6.2 Selecting the Width of the Kernel 

In each of the kernels K\, A is a parameter that controls its width: 

• For the Epanechnikov or tri-cube kernel with metric width, A is the 
radius of the support region. 

• For the Gaussian kernel, A is the standard deviation. 

• A is the number k of nearest neighbors in fc-nearest neighborhoods, 
often expressed as a fraction or span k/N of the total training sample. 
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FIGURE 6.7. Equivalent kernels for a local linear regression smoother (tri-cube 
kernel; orange) and a smoothing spline (blue), with matching degrees of freedom. 
The vertical spikes indicates the target points. 


There is a natural bias-variance tradeoff as we change the width of the 
averaging window, which is most explicit for local averages: 

• If the window is narrow, f(x o) is an average of a small number of y; 
close to Xq, and its variance will be relatively large—close to that of 
an individual y The bias will tend to be small, again because each 
of the E(yi) = f(xi ) should be close to f(x o). 

• If the window is wide, the variance of f(x o) will be small relative to 
the variance of any y;, because of the effects of averaging. The bias 
will be higher, because we are now using observations xi further from 
Xo, and there is no guarantee that f(xi) will be close to f(x o). 

Similar arguments apply to local regression estimates, say local linear: as 
the width goes to zero, the estimates approach a piecewise-lincar function 
that interpolates the training data 1 ; as the width gets infinitely large, the 
fit approaches the global linear least-squares fit to the data. 

The discussion in Chapter 5 on selecting the regularization parameter for 
smoothing splines applies here, and will not be repeated. Local regression 
smoothers are linear estimators; the smoother matrix in f = S^y is built up 
from the equivalent kernels (6.8), and has ij th entry {S^}^ = l^Xj). Leave- 
one-out cross-validation is particularly simple (Exercise 6.7), as is general¬ 
ized cross-validation, C p (Exercise 6.10), and fc-fold cross-validation. The 
effective degrees of freedom is again defined as trace(S>,), and can be used 
to calibrate the amount of smoothing. Figure 6.7 compares the equivalent 
kernels for a smoothing spline and local linear regression. The local regres¬ 
sion smoother has a span of 40%, which results in df = trace(S^) = 5.86. 
The smoothing spline was calibrated to have the same df, and their equiv¬ 
alent kernels are qualitatively quite similar. 


1 Willi uniformly spaced x,;; with irregularly spaced x,, the behavior can deteriorate. 
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6.3 Local Regression in IR 

Kernel smoothing and local regression generalize very naturally to two or 
more dimensions. The Nadaraya-Watson kernel smoother fits a constant 
locally with weights supplied by a p-dimensional kernel. Local linear re¬ 
gression will fit a hyperplane locally in X, by weighted least squares, with 
weights supplied by a p-dimensional kernel. It is simple to implement and 
is generally preferred to the local constant fit for its superior performance 
on the boundaries. 

Let b(X ) be a vector of polynomial terms in X of maximum degree d. 
For example, with d = 1 and p = 2 we get b(X) = (1, X 1 , X 2 )\ with d = 2 
we get b{X) = (1, Xi, X 2 , X^, Aif, X 1 X 2 ); and trivially with d = 0 we get 
b(X) = 1. At each xq € 1R P solve 

N 

min S2 K x (x 0 , a- b(xi) T /3(x 0 )) 2 (6.12) 

/3(x 0 ) , =1 

to produce the fit f(x 0 ) = b(x 0 ) T /3 (xq). Typically the kernel will be a radial 
function, such as the radial Epanechnikov or tri-cube kernel 

K\(x 0 ,x) = D j ( 6 . 13 ) 

where 11 • 11 is the Euclidean norm. Since the Euclidean norm depends on the 
units in each coordinate, it makes most sense to standardize each predictor, 
for example, to unit standard deviation, prior to smoothing. 

While boundary effects are a problem in one-dimensional smoothing, 
they are a much bigger problem in two or higher dimensions, since the 
fraction of points on the boundary is larger. In fact, one of the manifesta¬ 
tions of the curse of dimensionality is that the fraction of points close to the 
boundary increases to one as the dimension grows. Directly modifying the 
kernel to accommodate two-dimensional boundaries becomes very messy, 
especially for irregular boundaries. Local polynomial regression seamlessly 
performs boundary correction to the desired order in any dimensions. Fig¬ 
ure 6.8 illustrates local linear regression on some measurements from an 
astronomical study with an unusual predictor design (star-shaped). Here 
the boundary is extremely irregular, and the fitted surface must also inter¬ 
polate over regions of increasing data sparsity as we approach the boundary. 

Local regression becomes less useful in dimensions much higher than two 
or three. We have discussed in some detail the problems of dimensional¬ 
ity, for example, in Chapter 2. It is impossible to simultaneously main¬ 
tain localness (=> low bias) and a sizable sample in the neighborhood (=> 
low variance) as the dimension increases, without the total sample size in¬ 
creasing exponentially in p. Visualization of /(X) also becomes difficult in 
higher dimensions, and this is often one of the primary goals of smoothing. 
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FIGURE 6.8. The left panel shows three-dimensional data, where the response 
is the velocity measurements on a galaxy, and the two predictors record positions 
on the celestial sphere. The unusual “star”-shaped design indicates the way the 
measurements were made, and results in an extremely irregular boundary. The 
right panel shows the results of local linear regression smoothing in 1R 2 , using a 
nearest-neighbor window with 15% of the data. 


Although the scatter-cloud and wire-frame pictures in Figure 6.8 look at¬ 
tractive, it is quite difficult to interpret the results except at a gross level. 
From a data analysis perspective, conditional plots are far more useful. 

Figure 6.9 shows an analysis of some environmental data with three pre¬ 
dictors. The trellis display here shows ozone as a function of radiation, 
conditioned on the other two variables, temperature and wind speed. How¬ 
ever, conditioning on the value of a variable really implies local to that 
value (as in local regression). Above each of the panels in Figure 6.9 is an 
indication of the range of values present in that panel for each of the condi¬ 
tioning values. In the panel itself the data subsets are displayed (response 
versus remaining variable), and a one-dimensional local linear regression is 
fit to the data. Although this is not quite the same as looking at slices of 
a fitted three-dimensional surface, it is probably more useful in terms of 
understanding the joint behavior of the data. 


6.4 Structured Local Regression Models in IR P 

When the dimension to sample-size ratio is unfavorable, local regression 
does not help us much, unless we are willing to make some structural as¬ 
sumptions about the model. Much of this book is about structured regres¬ 
sion and classification models. Here we focus on some approaches directly 
related to kernel methods. 
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Solar Radiation (langleys) 

FIGURE 6.9. Three-dimensional smoothing example. The response is (cube-root 
of) ozone concentration, and the three predictors are temperature, wind speed and 
radiation. The trellis display shows ozone as a function of radiation, conditioned 
on intervals of temperature and wind speed (indicated by darker green or orange 
shaded bars). Each panel contains about 40% of the range of each of the condi¬ 
tioned variables. The curve in each panel is a univariate local linear regression, 
fit to the data in the panel. 
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6-4-1 Structured Kernels 

One line of approach is to modify the kernel. The default spherical ker¬ 
nel (6.13) gives equal weight to each coordinate, and so a natural default 
strategy is to standardize each variable to unit standard deviation. A more 
general approach is to use a positive semidefinite matrix A to weigh the 
different coordinates: 



(6.14) 


Entire coordinates or directions can be downgraded or omitted by imposing 
appropriate restrictions on A. For example, if A is diagonal, then we can 
increase or decrease the influence of individual predictors Xj by increasing 
or decreasing Ajj. Often the predictors are many and highly correlated, 
such as those arising from digitized analog signals or images. The covariance 
function of the predictors can be used to tailor a metric A that focuses less, 
say, on high-frequency contrasts (Exercise 6.4). Proposals have been made 
for learning the parameters for multidimensional kernels. For example, the 
projection-pursuit regression model discussed in Chapter 11 is of this flavor, 
where low-rank versions of A imply ridge functions for f(X). More general 
models for A are cumbersome, and we favor instead the structured forms 
for the regression function discussed next. 

6-4-2 Structured Regression Functions 

We are trying to fit a regression function E(Y\X) = /( Xi,X 2 , - - ■ ,X p ) in 
1R P , in which every level of interaction is potentially present. It is natural 
to consider analysis-of-variance (ANOVA) decompositions of the form 


f(X 1 ,X 2 , ...,X p ) = a + ^2 gj(Xj) + J29kRXk,X e ) + ■■■ (6.15) 


3 


and then introduce structure by eliminating some of the higher-order terms. 
Additive models assume only main effect terms: f{X) = a + X^=i 9j(Xj ); 
second-order models will have terms with interactions of order at most 
two, and so on. In Chapter 9, we describe iterative backjitting algorithms 
for fitting such low-order interaction models. In the additive model, for 
example, if all but the fcth term is assumed known, then we can estimate gk 
by local regression of Y — 9j(Xj) on X This is done for each function 
in turn, repeatedly, until convergence. The important detail is that at any 
stage, one-dimensional local regression is all that is needed. The same ideas 
can be used to fit low-dimensional ANOVA decompositions. 

An important special case of these structured models are the class of 
varying coefficient models. Suppose, for example, that we divide the p pre¬ 
dictors in X into a set (Xi,X 2 , ■ ■ ■, X q ) with q < p, and the remainder of 
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FIGURE 6.10. In each panel the aorta diameter is modeled as a linear func¬ 
tion of age. The coefficients of this model vary with gender and depth down 
the aorta (left is near the top, right is low down). There is a clear trend in the 
coefficients of the linear model. 


the variables we collect in the vector Z. We then assume the conditionally 
linear model 


f(X) = a(Z) + MZ)Xi + • • • + P q (Z)X q . (6.16) 

For given Z , this is a linear model, but each of the coefficients can vary 
with Z. It is natural to fit such a model by locally weighted least squares: 

N 

min V K x (z 0 , zf) {yt - a{z 0 ) - x u Pi {z 0 ) - x qi /3 q (z 0 )) 2 . 

ol{z 0 ),P(zq ) ' 

(6.17) 

Figure 6.10 illustrates the idea on measurements of the human aorta. 
A longstanding claim has been that the aorta thickens with age. Here we 
model the diameter of the aorta as a linear function of age, but allow the 
coefficients to vary with gender and depth down the aorta. We used a local 
regression model separately for males and females. While the aorta clearly 
does thicken with age at the higher regions of the aorta, the relationship 
fades with distance down the aorta. Figure 6.11 shows the intercept and 
slope as a function of depth. 
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Distance Down Aorta 


Distance Down Aorta 


FIGURE 6.11. The intercept and slope of age as a function of distance down 
the aorta, separately for males and females. The yellow bands indicate one stan¬ 
dard error. 


6.5 Local Likelihood and Other Models 

The concept of local regression and varying coefficient models is extremely 
broad: any parametric model can be made local if the fitting method ac¬ 
commodates observation weights. Here are some examples: 

• Associated with each observation y t is a parameter 9i = 9{xi ) = xj /3 
linear in the covariate(s) Xi, and inference for /3 is based on the log- 
likelihood l(/3) = YliLi KlJii x fP)- We can model 9(X) more flexibly 
by using the likelihood local to Xo for inference of 9(xo) = x£/3(x o): 

N 

KP( X o)) = '^2K x (x 0 ,x i )l(y i ,xJ f3(x 0 )). 

i=1 

Many likelihood models, in particular the family of generalized linear 
models including logistic and log-linear models, involve the covariates 
in a linear fashion. Local likelihood allows a relaxation from a globally 
linear model to one that is locally linear. 
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• As above, except different variables are associated with 9 from those 
used for defining the local likelihood: 

N 

mzo)) = ^ K\(z 0 ,Zi)l(yi,ri(xi,9(z 0 ))). 

2 = 1 


For example, rj(x , 9) = x T 9 could be a linear model in x. This will fit 
a varying coefficient model 9{z) by maximizing the local likelihood. 

• Autoregressive time series models of order k have the form y t = 

Po + PiUt-i + P 2 Vt -2 H-+ PkVt-k + £t- Denoting the lag set by 

Zt = (yt-ii Ut- 2 , ■ ■ ■, Vt-k), the model looks like a standard linear 
model yt = zpft + ey, and is typically fit by least squares. Fitting 
by local least squares with a kernel K(zo,z t ) allows the model to 
vary according to the short-term history of the series. This is to be 
distinguished from the more traditional dynamic linear models that 
vary by windowing time. 


As an illustration of local likelihood, we consider the local version of the 
multiclass linear logistic regression model (4.36) of Chapter 4. The data 
consist of features Xi and an associated categorical response gt £ {1,2,..., J}, 
and the linear model has the form 


Pr(G = j\X = x) 


e PjO+Pjx 

1 + Ek=i e^o+P^ ‘ 


(6.18) 


The local log-likelihood for this J class model can be written 


N 


^2 K\(x 0 ,Xi)< Pg i 0 (x 0 ) + p gi {x 0 ) T {xi - x 0 ) 


- log 


J -1 


1 + exp (/3fco(xo) + Pk{xo) T (xi - zq)) 


k=l 


(6.19) 


Notice that 

• we have used gt as a subscript in the first line to pick out the appro¬ 
priate numerator; 

• Pjo = 0 and /3j = 0 by the definition of the model; 

• we have centered the local regressions at x 0 , so that the fitted poste¬ 
rior probabilities at xq are simply 

e Pjo{x 0 ) 

l + T l J k Zl^ ko(xo) ' 


Pr(G = j\X = z 0 ) 


( 6 . 20 ) 
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FIGURE 6.12. Each plot shows the binary response CHD (coronary heart dis¬ 
ease) as a function of a risk factor for the South African heart disease data. 
For each plot we have computed the fitted prevalence of CHD using a local linear 
logistic regression model. The unexpected increase in the prevalence of CHD at 
the lower ends of the ranges is because these are retrospective data, and some of 
the subjects had already undergone treatment to reduce their blood pressure and 
weight. The shaded region in the plot indicates an estimated pointwise standard 
error band. 


This model can be used for flexible multiclass classification in moderately 
low dimensions, although successes have been reported with the high¬ 
dimensional ZIP-code classification problem. Generalized additive models 
(Chapter 9) using kernel smoothing methods are closely related, and avoid 
dimensionality problems by assuming an additive structure for the regres¬ 
sion function. 

As a simple illustration we fit a two-class local linear logistic model to 
the heart disease data of Chapter 4. Figure 6.12 shows the univariate local 
logistic models fit to two of the risk factors (separately). This is a useful 
screening device for detecting nonlinearities, when the data themselves have 
little visual information to offer. In this case an unexpected anomaly is 
uncovered in the data, which may have gone unnoticed with traditional 
methods. 

Since CHD is a binary indicator, we could estimate the conditional preva¬ 
lence Pr(G = j | Xq ) by simply smoothing this binary response directly with¬ 
out resorting to a likelihood formulation. This amounts to fitting a locally 
constant logistic regression model (Exercise 6.5). In order to enjoy the bias- 
correction of local-linear smoothing, it is more natural to operate on the 
unrestricted logit scale. 

Typically with logistic regression, we compute parameter estimates as 
well as their standard errors. This can be done locally as well, and so 
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Systolic Blood Pressure (for CHD group) 


FIGURE 6.13. A kernel density estimate for systolic blood pressure (for the 
CHD group). The density estimate at each point is the average contribution from 
each of the kernels at that point. We have scaled the kernels down by a factor of 
10 to make the graph readable. 


we can produce, as shown in the plot, estimated pointwise standard-error 
bands about our fitted prevalence. 


6.6 Kernel Density Estimation and Classification 

Kernel density estimation is an unsupervised learning procedure, which 
historically precedes kernel regression. It also leads naturally to a simple 
family of procedures for nonparametric classification. 


6.6.1 Kernel Density Estimation 

Suppose we have a random sample Xi,...,Xn drawn from a probability 
density fx(x), and we wish to estimate fx at a point xo. For simplicity we 
assume for now that X £ 1R. Arguing as before, a natural local estimate 
has the form 

Jx{x 0 ) = --, (6.21) 

where Af(xo) is a small metric neighborhood around xq of width A. This 
estimate is bumpy, and the smooth Parzen estimate is preferred 

1 N 

fx( x o) = ,Xi), 


( 6 . 22 ) 
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FIGURE 6.14. The left panel shows the two separate density estimates for 
systolic blood pressure in the CHD versus no-CHD groups, using a Gaussian 
kernel density estimate in each. The right panel shows the estimated posterior 
probabilities for CHD, using (6.25). 


because it counts observations close to Xq with weights that decrease with 
distance from xq. In this case a popular choice for K\ is the Gaussian kernel 
K\(xo, x) = <f>(\x — xo|/A). Figure 6.13 shows a Gaussian kernel density fit 
to the sample values for systolic blood pressure for the CHD group. Letting 
4>\ denote the Gaussian density with mean zero and standard-deviation A, 
then (6.22) has the form 


1 . . 

fx(x) = —Y^,<t>\{.x-Xi) 

i—1 

= (F -k (j)\)(x), (6.23) 

the convolution of the sample empirical distribution F with cf>\ . The dis¬ 
tribution F(x) puts mass 1/N at each of the observed x,;, and is jumpy; in 
fx{x) we have smoothed F by adding independent Gaussian noise to each 
observation Xj. 

The Parzen density estimate is the equivalent of the local average, and 
improvements have been proposed along the lines of local regression [on the 
log scale for densities; see Loader (1999)]. We will not pursue these here. 
In 1R P the natural generalization of the Gaussian density estimate amounts 
to using the Gaussian product kernel in (6.23), 


fx(xo) = 


1 


N(2X 2 n) 


N 

.^r e -§(ii^-*oii/A) 2 


sL/j 

2 - 

i— 1 


(6.24) 
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FIGURE 6.15. The population class densities may have interesting structure 
(left) that disappears when the posterior probabilities are formed (right). 


6.6.2 Kernel Density Classification 

One can use nonparametric density estimates for classification in a straight¬ 
forward fashion using Bayes’ theorem. Suppose for a J class problem we fit 
nonparametric density estimates fj(X ), j = 1,..., J separately in each of 
the classes, and we also have estimates of the class priors itj (usually the 
sample proportions). Then 


Pr(G = j\X = x 0 ) 


o) 

Efe=l Kkfk(xo) 


(6.25) 


Figure 6.14 uses this method to estimate the prevalence of CHD for the 
heart risk factor study, and should be compared with the left panel of Fig¬ 
ure 6.12. The main difference occurs in the region of high SBP in the right 
panel of Figure 6.14. In this region the data are sparse for both classes, and 
since the Gaussian kernel density estimates use metric kernels, the density 
estimates are low and of poor quality (high variance) in these regions. The 
local logistic regression method (6.20) uses the tri-cube kernel with FNN 
bandwidth; this effectively widens the kernel in this region, and makes use 
of the local linear assumption to smooth out the estimate (on the logit 
scale). 

If classification is the ultimate goal, then learning the separate class den¬ 
sities well may be unnecessary, and can in fact be misleading. Figure 6.15 
shows an example where the densities are both multimodal, but the pos¬ 
terior ratio is quite smooth. In learning the separate densities from data, 
one might decide to settle for a rougher, high-variance fit to capture these 
features, which are irrelevant for the purposes of estimating the posterior 
probabilities. In fact, if classification is the ultimate goal, then we need only 
to estimate the posterior well near the decision boundary (for two classes, 
this is the set {a;|Pr(G = 1\X = x) = |}). 


6.6.3 The Naive Bayes Classifier 

This is a technique that has remained popular over the years, despite its 
name (also known as “Idiot’s Bayes”!) It is especially appropriate when 
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the dimension p of the feature space is high, making density estimation 
unattractive. The naive Bayes model assumes that given a class G = j, the 
features X k are independent: 

v 

f J (X)=Y[f jk (X k ). (6.26) 

fc=l 

While this assumption is generally not true, it does simplify the estimation 
dramatically: 

• The individual class-conditional marginal densities fj k can each be 
estimated separately using one-dimensional kernel density estimates. 
This is in fact a generalization of the original naive Bayes procedures, 
which used univariate Gaussians to represent these marginals. 

• If a component Xj of X is discrete, then an appropriate histogram 
estimate can be used. This provides a seamless way of mixing variable 
types in a feature vector. 

Despite these rather optimistic assumptions, naive Bayes classifiers often 
outperform far more sophisticated alternatives. The reasons are related to 
Figure 6.15: although the individual class density estimates may be biased, 
this bias might not hurt the posterior probabilities as much, especially 
near the decision regions. In fact, the problem may be able to withstand 
considerable bias for the savings in variance such a “naive” assumption 
earns. 

Starting from (6.26) we can derive the logit-transform (using class J as 
the base): 


Pt(G = £\X) 
Pr(G = J|X) 


log 

log 


nefi{X) 
t Tjfj{X) 

TP 111=1 fek(X k ) 
n j nLi fjk(x k ) 



E lo § 


k=l 


fik{Xk) 

fjk(X k ) 


p 

gtk(X k ). 

k =1 


(6.27) 


This has the form of a generalized additive model , which is described in more 
detail in Chapter 9. The models are fit in quite different ways though; their 
differences are explored in Exercise 6.9. The relationship between naive 
Bayes and generalized additive models is analogous to that between linear 
discriminant analysis and logistic regression (Section 4.4.5). 
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6.7 Radial Basis Functions and Kernels 


In Chapter 5, functions are represented as expansions in basis functions: 
f(x) = Pjhj( x ). The art of flexible modeling using basis expansions 

consists of picking an appropriate family of basis functions, and then con¬ 
trolling the complexity of the representation by selection, regularization, or 
both. Some of the families of basis functions have elements that are defined 
locally; for example, B-splines are defined locally in IR. If more flexibility 
is desired in a particular region, then that region needs to be represented 
by more basis functions (which in the case of S-splines translates to more 
knots). Tensor products of IR-local basis functions deliver basis functions 
local in IR P . Not all basis functions are local—for example, the truncated 
power bases for splines, or the sigmoidal basis functions <r(o:o + ocx) used 
in neural-networks (see Chapter 11). The composed function f(x) can nev¬ 
ertheless show local behavior, because of the particular signs and values 
of the coefficients causing cancellations of global effects. For example, the 
truncated power basis has an equivalent R-spline basis for the same space 
of functions; the cancellation is exact in this case. 

Kernel methods achieve flexibility by fitting simple models in a region 
local to the target point Xq. Localization is achieved via a weighting kernel 
K\, and individual observations receive weights K\{xo,Xi). 

Radial basis functions combine these ideas, by treating the kernel func¬ 
tions K\(£,x) as basis functions. This leads to the model 

M 

/0) = ^ /' A , (s, • x ) Pj 

3 =1 

= < 6 - 28 ) 


where each basis element is indexed by a location or prototype parameter 
and a scale parameter A j. A popular choice for D is the standard Gaussian 
density function. There are several approaches to learning the parameters 
{A j,£j,Pj}, j = 1, For simplicity we will focus on least squares 

methods for regression, and use the Gaussian kernel. 

• Optimize the sum-of-squares with respect to all the parameters: 


N 


M 


mm 

IN Aj > Pj} 


Vi~ Po ~ 

i=i 


Pj exp < - 


{Xi ) (Xi £j ) 




2 


(6.29) 

This model is commonly referred to as an RBF network, an alterna¬ 
tive to the sigmoidal neural network discussed in Chapter 11; the £ ; 
and Xj playing the role of the weights. This criterion is nonconvex 
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FIGURE 6.16. Gaussian radial basis functions in IR with fixed width can leave 
holes (top panel). Renormalized Gaussian radial basis functions avoid this prob¬ 
lem, and produce basis functions similar in some respects to B-splines. 


with multiple local minima, and the algorithms for optimization are 
similar to those used for neural networks. 


• Estimate the {A 3 , } separately from the /3j. Given the former, the 

estimation of the latter is a simple least squares problem. Often the 
kernel parameters X :j and fj are chosen in an unsupervised way using 
the X distribution alone. One of the methods is to fit a Gaussian 
mixture density model to the training x j, which provides both the 
centers and the scales Xj. Other even more adhoc approaches use 
clustering methods to locate the prototypes £j, and treat A j = X 
as a hyper-parameter. The obvious drawback of these approaches is 
that the conditional distribution Pr(Y|X) and in particular E(Y\X) 
is having no say in where the action is concentrated. On the positive 
side, they are much simpler to implement. 


While it would seem attractive to reduce the parameter set and assume 
a constant value for A j = A, this can have an undesirable side effect of 
creating holes —regions of Ht p where none of the kernels has appreciable 
support, as illustrated in Figure 6.16 (upper panel). Renormalized radial 
basis functions, 


hj(x) = 


D( ||z-611/A) 


Z^k =l 


D( 


(6.30) 


,fc=i ^w\ x — £fc||/A) 

avoid this problem (lower panel). 

The Nadaraya-Watson kernel regression estimator (6.2) in 1R P can be 
viewed as an expansion in renormalized radial basis functions, 


f( x o) 


_ ,, Kx(xo.xi) 

^=1 yt T.t lK x( X0 , Xi ) 


(6.31) 
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with a basis function hi located at every observation and coefficients yi\ 
that is, & = Xi, Pi = yi, i = 1,..., N. 

Note the similarity between the expansion (6.31) and the solution (5.50) 
on page 169 to the regularization problem induced by the kernel K. Radial 
basis functions form the bridge between the modern “kernel methods” and 
local fitting technology. 


6.8 Mixture Models for Density Estimation and 
Classification 

The mixture model is a useful tool for density estimation, and can be viewed 
as a kind of kernel method. The Gaussian mixture model has the form 

M 

f(x) = ^2 a mH X ’> hm, E m ) (6.32) 

m= 1 

with mixing proportions a m , a m = 1, and each Gaussian density has 
a mean y, m and covariance matrix S m . In general, mixture models can use 
any component densities in place of the Gaussian in (6.32): the Gaussian 
mixture model is by far the most popular. 

The parameters are usually fit by maximum likelihood, using the EM 
algorithm as described in Chapter 8. Some special cases arise: 

• If the covariance matrices are constrained to be scalar: S m = a m I, 
then (6.32) has the form of a radial basis expansion. 

• If in addition a m = a > 0 is fixed, and M \ N , then the max¬ 
imum likelihood estimate for (6.32) approaches the kernel density 
estimate (6.22) where a m = 1/N and fi m = x m . 

Using Bayes’ theorem, separate mixture densities in each class lead to flex¬ 
ible models for Pr(G|X); this is taken up in some detail in Chapter 12. 

Figure 6.17 shows an application of mixtures to the heart disease risk- 
factor study. In the top row are histograms of Age for the no CHD and CHD 
groups separately, and then combined on the right. Using the combined 
data, we fit a two-component mixture of the form (6.32) with the (scalars) 
Si and S 2 not constrained to be equal. Fitting was done via the EM 
algorithm (Chapter 8): note that the procedure does not use knowledge of 
the CHD labels. The resulting estimates were 

p, r = 36.4, id = 157.7, dq = 0.7, 

P 2 = 58.0, id = 15.6, 0:2 = 0.3. 

The component densities d(Aiji'i) and (M/d^id) are shown in the lower- 
left and middle panels. The lower-right panel shows these component den¬ 
sities (orange and blue) along with the estimated mixture density (green). 
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No CHD 


CHD 


Combined 



FIGURE 6.17. Application of mixtures to the heart disease risk-factor study. 
(Top row:) Histograms of Age for the no CHD and CHD groups separately, and 
combined. (Bottom row:) estimated component densities from a Gaussian mix¬ 
ture model, (bottom left, bottom middle); (bottom right:) Estimated component 
densities (blue and orange) along with the estimated mixture density (green). The 
orange density has a very large standard deviation, and approximates a uniform 
density. 


The mixture model also provides an estimate of the probability that 
observation i belongs to component to, 


i^mi ^m) 


(6.33) 


where X; is Age in our example. Suppose we threshold each value f ,2 and 
hence define 5; = /(fj 2 > 0.5). Then we can compare the classification of 
each observation by CHD and the mixture model: 



Mixture model 


O 

II 

5 = 1 

CHD No 

232 

70 

Yes 

76 

84 


Although the mixture model did not use the CHD labels, it has done a fair 
job in discovering the two CHD subpopulations. Linear logistic regression, 
using the CHD as a response, achieves the same error rate (32%) when fit to 
these data using maximum-likelihood (Section 4.4). 
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6.9 Computational Considerations 

Kernel and local regression and density estimation are memory-based meth¬ 
ods: the model is the entire training data set, and the fitting is done at 
evaluation or prediction time. For many real-time applications, this can 
make this class of methods infeasible. 

The computational cost to fit at a single observation xo is O(N) flops, 
except in oversimplified cases (such as square kernels). By comparison, 
an expansion in M basis functions costs 0(M) for one evaluation, and 
typically M ~ O(logiV). Basis function methods have an initial cost of at 
least 0(NM 2 + M 3 ). 

The smoothing parameter(s) A for kernel methods are typically deter¬ 
mined off-line, for example using cross-validation, at a cost of 0(N 2 ) flops. 

Popular implementations of local regression, such as the loess function in 
S-PLUS and R and the locfit procedure (Loader, 1999), use triangulation 
schemes to reduce the computations. They compute the fit exactly at M 
carefully chosen locations (0(NM)), and then use blending techniques to 
interpolate the fit elsewhere (O(M) per evaluation). 


Bibliographic Notes 

There is a vast literature on kernel methods which we will not attempt to 
summarize. Rather we will point to a few good references that themselves 
have extensive bibliographies. Loader (1999) gives excellent coverage of lo¬ 
cal regression and likelihood, and also describes state-of-the-art software 
for fitting these models. Fan and Gijbels (1996) cover these models from 
a more theoretical aspect. Hastie and Tibshirani (1990) discuss local re¬ 
gression in the context of additive modeling. Silverman (1986) gives a good 
overview of density estimation, as does Scott (1992). 


Exercises 


Ex. 6.1 Show that the Nadar aya-Wat son kernel smooth with fixed metric 
bandwidth A and a Gaussian kernel is differentiable. What can be said for 
the Epanechnikov kernel? What can be said for the Epanechnikov kernel 
with adaptive nearest-neighbor bandwidth A(xo)? 

Ex. 6.2 Show that Xo)k{xo) = 0 for local linear regression. Define 

bj(xo ) = Eti ( x i ~ Xo)Hi(xo). Show that 60 (^ 0 ) = 1 for local polynomial 
regression of any degree (including local constants). Show that bj(x 0 ) = 0 
for all j e {1,2,..., k} for local polynomial regression of degree k. What 
are the implications of this on the bias? 
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Ex. 6.3 Show that ||i(x)|| (Section 6.1.2) increases with the degree of the 
local polynomial. 

Ex. 6.4 Suppose that the p predictors X arise from sampling relatively 
smooth analog curves at p uniformly spaced abscissa values. Denote by 
Cov(X|E) = £ the conditional covariance matrix of the predictors, and 
assume this does not change much with Y. Discuss the nature of Maha- 
lanobis choice A = SP 1 for the metric in (6.14). How does this compare 
with A = I? How might you construct a kernel A that (a) downweights 
high-frequency components in the distance metric; (b) ignores them 
completely? 

Ex. 6.5 Show that fitting a locally constant multinomial logit model of 
the form (6.19) amounts to smoothing the binary response indicators for 
each class separately using a Nadaraya-Watson kernel smoother with kernel 
weights K\(xo,Xi). 

Ex. 6.6 Suppose that all you have is software for fitting local regression, 
but you can specify exactly which monomials are included in the fit. How 
could you use this software to fit a varying-coefficient model in some of the 
variables? 

Ex. 6.7 Derive an expression for the leave-one-out cross-validated residual 
sum-of-squares for local polynomial regression. 

Ex. 6.8 Suppose that for continuous response Y and predictor X 1 we model 
the joint density of X, Y using a multivariate Gaussian kernel estimator. 
Note that the kernel in this case would be the product kernel (j>\(X)<j)\(Y). 
Show that the conditional mean E(Y\X) derived from this estimate is a 
Nadaraya-Watson estimator. Extend this result to classification by pro¬ 
viding a suitable kernel for the estimation of the joint distribution of a 
continuous X and discrete Y. 

Ex. 6.9 Explore the differences between the naive Bayes model (6.27) and 
a generalized additive logistic regression model, in terms of (a) model as¬ 
sumptions and (b) estimation. If all the variables Xk are discrete, what can 
you say about the corresponding GAM? 

Ex. 6.10 Suppose we have N samples generated from the model y,; = /(xj) + 
£», with Ei independent and identically distributed with mean zero and 
variance tx 2 , the Xi assumed fixed (non random). We estimate / using a 
linear smoother (local regression, smoothing spline, etc.) with smoothing 
parameter A. Thus the vector of fitted values is given by f = S>,y. Consider 
the in-sample prediction error 



(6.34) 
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for predicting new responses at the N input values. Show that the aver¬ 
age squared residual on the training data, ASR(A), is a biased estimate 
(optimistic) for PE(A), while 


C\ = ASR(A) + -^-trace(S>,) 


(6.35) 


is unbiased. 


Ex. 6.11 Show that for the Gaussian mixture model (6.32) the likelihood 
is maximized at +oo, and describe how. 


Ex. 6.12 Write a computer program to perform a local linear discrimi¬ 
nant analysis. At each query point xq, the training data receive weights 
Kx(xo, Xi) from a weighting kernel, and the ingredients for the linear deci¬ 
sion boundaries (see Section 4.3) are computed by weighted averages. Try 
out your program on the zipcode data, and show the training and test er¬ 
rors for a series of five pre-chosen values of A. The zipcode data are available 
from the book website www-stat.stanford.edu/ElemStatLearn. 


This is page 219 
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Model Assessment and Selection 


7.1 Introduction 


The generalization performance of a learning method relates to its predic¬ 
tion capability on independent test data. Assessment of this performance 
is extremely important in practice, since it guides the choice of learning 
method or model, and gives us a measure of the quality of the ultimately 
chosen model. 

In this chapter we describe and illustrate the key methods for perfor¬ 
mance assessment, and show how they are used to select models. We begin 
the chapter with a discussion of the interplay between bias, variance and 
model complexity. 

7.2 Bias, Variance and Model Complexity 

Figure 7.1 illustrates the important issue in assessing the ability of a learn¬ 
ing method to generalize. Consider first the case of a quantitative or interval 
scale response. We have a target variable Y, a vector of inputs X , and a 
prediction model f(X) that has been estimated from a training set T. 
The loss function for measuring errors between Y and /( X) is denoted by 
L(Y, f(X)). Typical choices are 



squared error 
absolute error. 


(7.1) 
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FIGURE 7.1. Behavior of test sample and training sample error as the model 
complexity is varied. The light blue curves show the training error err, while the 
light red curves show the conditional test error Err 7 - for 100 training sets of size 
50 each, as the model complexity is increased. The solid curves show the expected 
test error Err and the expected training error E[err], 

Test error , also referred to as generalization error , is the prediction error 
over an independent test sample 

Err r = E[L(Yj(X))\T\ (7.2) 

where both X and Y are drawn randomly from their joint distribution 
(population). Here the training set T is fixed, and test error refers to the 
error for this specific training set. A related quantity is the expected pre¬ 
diction error (or expected test error) 

Err = E [L{Y, f(X))\ = E[Err r ], (7.3) 

Note that this expectation averages over everything that is random, includ¬ 
ing the randomness in the training set that produced /. 

Figure 7.1 shows the prediction error (light red curves) Err-/- for 100 
simulated training sets each of size 50. The lasso (Section 3.4.2) was used 
to produce the sequence of fits. The solid red curve is the average, and 
hence an estimate of Err. 

Estimation of E1T7- will be our goal, although we will see that Err is 
more amenable to statistical analysis, and most methods effectively esti¬ 
mate the expected error. It does not seem possible to estimate conditional 



7.2 Bias, Variance and Model Complexity 221 


error effectively, given only the information in the same training set. Some 
discussion of this point is given in Section 7.12. 

Training error is the average loss over the training sample 

1 N 

err = L /fo))- ( 7 - 4 ) 

i=i 

We would like to know the expected test error of our estimated model 
/. As the model becomes more and more complex, it uses the training 
data more and is able to adapt to more complicated underlying structures. 
Hence there is a decrease in bias but an increase in variance. There is some 
intermediate model complexity that gives minimum expected test error. 

Unfortunately training error is not a good estimate of the test error, 
as seen in Figure 7.1. Training error consistently decreases with model 
complexity, typically dropping to zero if we increase the model complexity 
enough. However, a model with zero training error is overfit to the training 
data and will typically generalize poorly. 

The story is similar for a qualitative or categorical response G taking 
one of K values in a set Q, labeled for convenience as 1, 2,..., K. Typically 
we model the probabilities p k {X ) = Pr(G = k\X) (or some monotone 
transformations /^(X)), and then G(X) = argma x k p k (X). In some cases, 
such as 1 -nearest neighbor classification (Chapters 2 and 13) we produce 
G(X) directly. Typical loss functions are 

L(G,G(X)) = /(G^G(X)) (0-1 loss), (7.5) 

K 

L(G,p(X)) = -2j2l(G = k)\ogp k (X) 

k =1 

= —2log pg(X) (—2 x log-likelihood). (7.6) 

The quantity — 2 x the log-likelihood is sometimes referred to as the 
deviance. 

Again, test error here is Eri' 7 - = E[L(G, G(X))|T], the population mis- 
classification error of the classifier trained on T, and Err is the expected 
misclassification error. 

Training error is the sample analogue, for example, 

2 N 

6rr = ~N J2 log P9^ Xi ^ ( 7J ) 

the sample log-likelihood for the model. 

The log-likelihood can be used as a loss-function for general response 
densities, such as the Poisson, gamma, exponential, log-normal and others. 
If Pr$( y) ( Y ) is the density of Y, indexed by a parameter 9{X) that depends 
on the predictor X, then 

L{Y,0{X)) = -2-logPr fl(x) (y). 


(7.8) 
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The “—2” in the definition makes the log-likelihood loss for the Gaussian 
distribution match squared-error loss. 

For ease of exposition, for the remainder of this chapter we will use Y and 
f(X) to represent all of the above situations, since we focus mainly on the 
quantitative response (squared-error loss) setting. For the other situations, 
the appropriate translations are obvious. 

In this chapter we describe a number of methods for estimating the 
expected test error for a model. Typically our model will have a tuning 
parameter or parameters a and so we can write our predictions as f a (x). 
The tuning parameter varies the complexity of our model, and we wish to 
find the value of a that minimizes error, that is, produces the minimum of 
the average test error curve in Figure 7.1. Having said this, for brevity we 
will often suppress the dependence of f(x) on a. 

It is important to note that there are in fact two separate goals that we 
might have in mind: 

Model selection: estimating the performance of different models in order 
to choose the best one. 

Model assessment: having chosen a final model, estimating its predic¬ 
tion error (generalization error) on new data. 

If we are in a data-rich situation, the best approach for both problems is 
to randomly divide the dataset into three parts: a training set, a validation 
set, and a test set. The training set is used to fit the models; the validation 
set is used to estimate prediction error for model selection; the test set is 
used for assessment of the generalization error of the final chosen model. 
Ideally, the test set should be kept in a “vault,” and be brought out only 
at the end of the data analysis. Suppose instead that we use the test-set 
repeatedly, choosing the model with smallest test-set error. Then the test 
set error of the final chosen model will underestimate the true test error, 
sometimes substantially. 

It is difficult to give a general rule on how to choose the number of 
observations in each of the three parts, as this depends on the signal-to- 
noise ratio in the data and the training sample size. A typical split might 
be 50% for training, and 25% each for validation and testing: 


Validation Test 


The methods in this chapter are designed for situations where there is 
insufficient data to split it into three parts. Again it is too difficult to give 
a general rule on how much training data is enough; among other things, 
this depends on the signal-to-noise ratio of the underlying function, and 
the complexity of the models being fit to the data. 
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The methods of this chapter approximate the validation step either an¬ 
alytically (AIC, BIC, MDL, SRM) or by efficient sample re-use (cross- 
validation and the bootstrap). Besides their use in model selection, we also 
examine to what extent each method provides a reliable estimate of test 
error of the final chosen model. 

Before jumping into these topics, we first explore in more detail the 
nature of test error and the bias-variance tradeoff. 


7.3 The Bias-Variance Decomposition 


As in Chapter 2, if we assume that Y = f(X) + e where E(e) = 0 and 
Var(e) = o/, we can derive an expression for the expected prediction error 
of a regression fit f(X) at an input point X = Xq, using squared-error loss: 


Err (so) = E[(Y - f(x 0 )) 2 \X = x 0 ] 

= <Tg + [E/(x q ) - /(so)] 2 + E[f(x o) - E/(so)] 2 
= a 2 + Bias 2 (/(s 0 )) + Var(/(s 0 )) 

= Irreducible Error + Bias 2 + Variance. (7-9) 


The first term is the variance of the target around its true mean /(so), and 
cannot be avoided no matter how well we estimate /(so), unless a1 = 0. 
The second term is the squared bias, the amount by which the average of 
our estimate differs from the true mean; the last term is the variance; the 
expected squared deviation of f(x o) around its mean. Typically the more 
complex we make the model /, the lower the (squared) bias but the higher 
the variance. 

For the fc-nearest-neighbor regression fit, these expressions have the sim¬ 
ple form 


Err(s 0 ) 


E[(Y - f k (x 0 )) 2 \X = x 0 ] 


n 2 




e=i 


<n 

k 


(7.10) 


Here we assume for simplicity that training inputs x, are fixed, and the ran¬ 
domness arises from the yt . The number of neighbors k is inversely related 
to the model complexity. For small k , the estimate fk(x) can potentially 
adapt itself better to the underlying f(x). As we increase k, the bias—the 
squared difference between f(x o) and the average of f(x) at the A:-nearest 
neighbors—will typically increase, while the variance decreases. 

For a linear model fit f p (x) = x T where the parameter vector /3 with 
p components is fit by least squares, we have 

Err(x 0 ) = E[(Y - f p (x 0 )) 2 \X = x 0 \ 
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= a e + lf( x o) ~ E/p(xo )] 2 + ||h(xo)|| 2 cr|. (7.11) 

Here h(cco) = X(X T X) _ 1 xo, the N-ve ctor of linear weights that produce 
the fit f p (x 0 ) = x 0 T (X r X) _ 1 X r y J and hence Var[/ p (a;o)] = ||h(a;o)|| 2 cr 2 . 
While this variance changes with Xo, its average (with xq taken to be each 
of the sample values Xi) is ( p/N)a 2 , and hence 

, N N 

— Err^;) = erf + — ^[/(Xi) - E f( Xi )} 2 + jja 2 , (7.12) 

i= 1 i= 1 

the in-sample error. Here model complexity is directly related to the num¬ 
ber of parameters p. 

The test error Err(xo) for a ridge regression fit f a (x o) is identical in 
form to (7.11), except the linear weights in the variance term are different: 
h(xo) = X(X T X + aT)~ 1 xo. The bias term will also be different. 

For a linear model family such as ridge regression, we can break down 
the bias more finely. Let /?* denote the parameters of the best-fitting linear 
approximation to /: 

/3* = argnnnE (/(X) — X T fi) 2 . (7-13) 


Here the expectation is taken with respect to the distribution of the input 
variables X. Then we can write the average squared bias as 


E* 


1 2 


f(x 0 )-F,f a (x 0 ) = E Xo [f(xo) — Xq(3*\ 2 + E Xo Xq f3* - Ex'Sp a 

= Ave[Model Bias ] 2 + Ave[Estimation Bias ] 2 

(7.14) 


The first term on the right-hand side is the average squared model bias , the 
error between the best-fitting linear approximation and the true function. 
The second term is the average squared estimation bias , the error between 
the average estimate E(xJ 0) and the best-fitting linear approximation. 

For linear models fit by ordinary least squares, the estimation bias is zero. 
For restricted fits, such as ridge regression, it is positive, and we trade it off 
with the benefits of a reduced variance. The model bias can only be reduced 
by enlarging the class of linear models to a richer collection of models, by 
including interactions and transformations of the variables in the model. 

Figure 7.2 shows the bias-variance tradeoff schematically. In the case 
of linear models, the model space is the set of all linear predictions from 
p inputs and the black dot labeled “closest fit” is x T /3*. The blue-shaded 
region indicates the error cr e with which we see the truth in the training 
sample. 

Also shown is the variance of the least squares fit, indicated by the large 
yellow circle centered at the black dot labeled “closest fit in population,’ 
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Closest fit in population 



FIGURE 7.2. Schematic of the behavior of bias and variance. The model space 
is the set of all possible predictions from the model, with the “closest fit” labeled 
with a black dot. The model bias from the truth is shown, along with the variance, 
indicated by the large yellow circle centered at the black dot labeled “closest fit 
in population. ” A shrunken or regularized fit is also shown, having additional 
estimation bias, but smaller prediction error due to its decreased variance. 
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Now if we were to fit a model with fewer predictors, or regularize the coef¬ 
ficients by shrinking them toward zero (say), we would get the “shrunken 
fit” shown in the figure. This fit has an additional estimation bias, due to 
the fact that it is not the closest fit in the model space. On the other hand, 
it has smaller variance. If the decrease in variance exceeds the increase in 
(squared) bias, then this is worthwhile. 


7.3.1 Example: Bias-Variance Tradeoff 

Figure 7.3 shows the bias-variance tradeoff for two simulated examples. 
There are 80 observations and 20 predictors, uniformly distributed in the 
hypercube [0, l] 20 . The situations are as follows: 

Left panels: Y is 0 if X 1 <1/2 and 1 if Xi > 1/2, and we apply A:-nearest 
neighbors. 

Right panels: Y is 1 if Xj is greater than 5 and 0 otherwise, and we 
use best subset linear regression of size p. 

The top row is regression with squared error loss; the bottom row is classi¬ 
fication with 0-1 loss. The figures show the prediction error (red), squared 
bias (green) and variance (blue), all computed for a large test sample. 

In the regression problems, bias and variance add to produce the predic¬ 
tion error curves, with minima at about k = 5 for fc-nearest neighbors, and 
p > 10 for the linear model. For classification loss (bottom figures), some 
interesting phenomena can be seen. The bias and variance curves are the 
same as in the top figures, and prediction error now refers to misclassifi- 
cation rate. We see that prediction error is no longer the sum of squared 
bias and variance. For the fc-nearest neighbor classifier, prediction error 
decreases or stays the same as the number of neighbors is increased to 20, 
despite the fact that the squared bias is rising. For the linear model classi¬ 
fier the minimum occurs for p > 10 as in regression, but the improvement 
over the p = 1 model is more dramatic. We see that bias and variance seem 
to interact in determining prediction error. 

Why does this happen? There is a simple explanation for the first phe¬ 
nomenon. Suppose at a given input point, the true probability of class 1 is 
0.9 while the expected value of our estimate is 0.6. Then the squared bias— 
(0.6 — 0.9) 2 is considerable, but the prediction error is zero since we make 
the correct decision. In other words, estimation errors that leave us on the 
right side of the decision boundary don’t hurt. Exercise 7.2 demonstrates 
this phenomenon analytically, and also shows the interaction effect between 
bias and variance. 

The overall point is that the bias-variance tradeoff behaves differently 
for 0-1 loss than it does for squared error loss. This in turn means that 
the best choices of tuning parameters may differ substantially in the two 
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FIGURE 7.3. Expected prediction error (orange), squared bias (green) and vari¬ 
ance (blue) for a simulated example. The top row is regression with squared error 
loss; the bottom row is classification with 0-1 loss. The models are k-nearest 
neighbors (left) and best subset regression of size p (right). The variance and bias 
curves are the same in regression and classification, but the prediction error curve 
is different. 
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settings. One should base the choice of tuning parameter on an estimate of 
prediction error, as described in the following sections. 


7.4 Optimism of the Training Error Rate 

Discussions of error rate estimation can be confusing, because we have 
to make clear which quantities are fixed and which are random 1 . Before 
we continue, we need a few definitions, elaborating on the material of Sec¬ 
tion 7.2. Given a training set T = {(cci , y{), (x2, IJ2), ... (xjv, Vn)} the gen¬ 
eralization error of a model / is 

Err r = E x o }Y o[L(Y°J(X°))\n (7.15) 

Note that the training set T is fixed in expression (7.15). The point (AT 0 , Y°) 
is a new test data point, drawn from F, the joint distribution of the data. 
Averaging over training sets T yields the expected error 

Err = E r E x o yo [L(Y°,f(X°))\T], (7.16) 

which is more amenable to statistical analysis. As mentioned earlier, it 
turns out that most methods effectively estimate the expected error rather 
than E 7 -; see Section 7.12 for more on this point. 

Now typically, the training error 

1 N 

err = JjYl /fo)) ( 7 -17) 

i=1 

will be less than the true error Err 7 -, because the same data is being used 
to fit the method and assess its error (see Exercise 2.9). A fitting method 
typically adapts to the training data, and hence the apparent or training 
error err will be an overly optimistic estimate of the generalization error 
Err 7-. 

Part of the discrepancy is due to where the evaluation points occur. The 
quantity Err 7- can be thought of as extra-sample error, since the test input 
vectors don’t need to coincide with the training input vectors. The nature 
of the optimism in err is easiest to understand when we focus instead on 
the in-sample error 

1 N 

Err in = ^T / ^o[L(Y^f(x i ))\T] (7.18) 

V 2=1 

The y° notation indicates that we observe N new response values at 
each of the training points Xi, i == 1, 2,..., N. We define the optimism as 


1 Indeed, in the first edition of our book, this section wasn’t sufficiently clear. 
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the difference between Errj n and the training error err: 


op = Err ; n — err. 


(7.19) 


This is typically positive since err is usually biased downward as an estimate 
of prediction error. Finally, the average optimism is the expectation of the 
optimism over training sets 


U) = Ey(op). 


(7.20) 


Here the predictors in the training set are fixed, and the expectation is 
over the training set outcome values; hence we have used the notation E y 
instead of E 7 -. We can usually estimate only the expected error to rather 
than op, in the same way that we can estimate the expected error Err 
rather than the conditional error Err 7 -. 

For squared error, 0-1, and other loss functions, one can show quite 
generally that 



(7.21) 


where Cov indicates covariance. Thus the amount by which err underesti¬ 
mates the true error depends on how strongly yi affects its own prediction. 
The harder we fit the data, the greater Cov(y i ,j/ i ) will be, thereby increas¬ 
ing the optimism. Exercise 7.4 proves this result for squared error loss where 
%ii is the fitted value from the regression. For 0-1 loss, yi € {0,1} is the 
classification at Xi, and for entropy loss, y % € [ 0 , 1 ] is the fitted probability 
of class 1 at x^. 

In summary, we have the important relation 



(7.22) 


This expression simplifies if jji is obtained by a linear fit with d, inputs 
or basis functions. For example, 


N 



(7.23) 


for the additive error model Y = f(X) + e, and so 



(7.24) 


Expression (7.23) is the basis for the definition of the effective number of 
parameters discussed in Section 7.6 The optimism increases linearly with 
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the number d of inputs or basis functions we use, but decreases as the 
training sample size increases. Versions of (7.24) hold approximately for 
other error models, such as binary data and entropy loss. 

An obvious way to estimate prediction error is to estimate the optimism 
and then add it to the training error err. The methods described in the 
next section— C p , AIC, BIC and others—work in this way, for a special 
class of estimates that are linear in their parameters. 

In contrast, cross-validation and bootstrap methods, described later in 
the chapter, are direct estimates of the extra-sample error Err. These gen¬ 
eral tools can be used with any loss function, and with nonlinear, adaptive 
fitting techniques. 

In-sample error is not usually of direct interest since future values of the 
features are not likely to coincide with their training set values. But for 
comparison between models, in-sample error is convenient and often leads 
to effective model selection. The reason is that the relative (rather than 
absolute) size of the error is what matters. 

7.5 Estimates of In-Sample Prediction Error 

The general form of the in-sample estimates is 


Erri n = err + u>, (7-25) 

where Cj is an estimate of the average optimism. 

Using expression (7.24), applicable when d parameters are fit under 
squared error loss, leads to a version of the so-called C p statistic, 

C p =efr + 2- -^<r £ 2 . (7.26) 

Here cf e 2 is an estimate of the noise variance, obtained from the mean- 
squared error of a low-bias model. Using this criterion we adjust the training 
error by a factor proportional to the number of basis functions used. 

The Akaike information criterion is a similar but more generally appli¬ 
cable estimate of Errj n when a log-likelihood loss function is used. It relies 
on a relationship similar to (7.24) that holds asymptotically as N —> oo: 

-2 • E[logPrg(y)] » • E[loglik] + 2 • ^. (7.27) 

Here Prg(U) is a family of densities for Y (containing the “true” density), 
9 is the maximum-likelihood estimate of (9, and “loglik” is the maximized 
log-likelihood: 

N 

loglik = ^ logPr fl ~(y, ; ). 

2—1 


(7.28) 
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For example, for the logistic regression model, using the binomial log- 
likelihood, we have 


AIC = -|-loglik + 2-^. (7.29) 

For the Gaussian model (with variance af = a 2 assumed known), the AIC 
statistic is equivalent to C p , and so we refer to them collectively as AIC. 

To use AIC for model selection, we simply choose the model giving small¬ 
est AIC over the set of models considered. For nonlinear and other complex 
models, we need to replace d by some measure of model complexity. We 
discuss this in Section 7.6. 

Given a set of models f a (x) indexed by a tuning parameter a, denote 
by efr(a) and d[a) the training error and number of parameters for each 
model. Then for this set of models we define 

AIC(a) = err(a) + 2 • ^-a 2 . (7.30) 

The function AIC(a) provides an estimate of the test error curve, and we 
find the tuning parameter a that minimizes it. Our final chosen model 
is f&(x). Note that if the basis functions are chosen adaptively, (7.23) no 
longer holds. For example, if we have a total of p inputs, and we choose 
the best-fitting linear model with d < p inputs, the optimism will exceed 
(2d/N)crf. Put another way, by choosing the best-fitting model with d 
inputs, the effective number of parameters fit is more than d. 

Figure 7.4 shows AIC in action for the phoneme recognition example 
of Section 5.2.3 on page 148. The input vector is the log-periodogram of 
the spoken vowel, quantized to 256 uniformly spaced frequencies. A lin¬ 
ear logistic regression model is used to predict the phoneme class, with 
coefficient function /3(f) = X^m=i an expansion in M spline ba¬ 

sis functions. For any given M, a basis of natural cubic splines is used 
for the h m , with knots chosen uniformly over the range of frequencies (so 
d(a) = d(M) = M). Using AIC to select the number of basis functions will 
approximately minimize Err(M) for both entropy and 0-1 loss. 

The simple formula 


N 

( 2 M0 CoV (&> Vi) = ( 2d / N )<Xe 

i= 1 

holds exactly for linear models with additive errors and squared error loss, 
and approximately for linear models and log-likelihoods. In particular, the 
formula does not hold in general for 0-1 loss (Efron, 1986), although many 
authors nevertheless use it in that context (right panel of Figure 7.4). 
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FIGURE 7.4. AIC used for model selection for the phoneme recogni¬ 
tion example of Section 5.2.3. The logistic regression coefficient function 
/3(f) = 53 =i hm(f)9m is modeled as an expansion in M spline basis functions. 
In the left panel we see the AIC statistic used to estimate Erri n using log-likeli¬ 
hood loss. Included is an estimate of Err based on an independent test sample. It 
does well except for the extremely over-parametrized case (M = 256 parameters 
for N = 1000 observations). In the right panel the same is done for 0-1 loss. 
Although the AIC formula does not strictly apply here, it does a reasonable job in 
this case. 


7.6 The Effective Number of Parameters 

The concept of “number of parameters” can be generalized, especially to 
models where regularization is used in the fitting. Suppose we stack the 
outcomes yi,y2, ■ ■ ■ ,Vn into a vector y, and similarly for the predictions 
y. Then a linear fitting method is one for which we can write 

y = Sy, (7.31) 

where S is an N x TV matrix depending on the input vectors Xi but not on 
the y % . Linear fitting methods include linear regression on the original fea¬ 
tures or on a derived basis set, and smoothing methods that use quadratic 
shrinkage, such as ridge regression and cubic smoothing splines. Then the 
effective number of parameters is defined as 

df(S) = trace(S), (7.32) 

the sum of the diagonal elements of S (also known as the effective degrees- 
of-freedom). Note that if S is an orthogonal-projection matrix onto a basis 
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set spanned by M features, then trace(S) = M. It turns out that trace(S) is 
exactly the correct quantity to replace d as the number of parameters in the 
C p statistic (7.26). If y arises from an additive-error model Y = f(X) + £ 
with Var(e) = a 2 , then one can show that ]Tb=i Cov(yj,yj) = trace(S)crf, 
which motivates the more general definition 

df(y) = g.'bCovfe.y,) ( 7 .33) 

(Exercises 7.4 and 7.5). Section 5.4.1 on page 153 gives some more intuition 
for the definition df = trace(S) in the context of smoothing splines. 

For models like neural networks, in which we minimize an error function 
R(w ) with weight decay penalty (regularization) ctJ2 m w rm the effective 
number of parameters has the form 


df(a) = 


M 

m= 1 


+ OL 


(7.34) 


where the 0 m are the eigenvalues of the Hessian matrix d 2 R(w)/dwdw T . 
Expression (7.34) follows from (7.32) if we make a quadratic approximation 
to the error function at the solution (Bishop, 1995). 


7.7 The Bayesian Approach and BIC 


The Bayesian information criterion (BIC), like AIC, is applicable in settings 
where the fitting is carried out by maximization of a log-likelihood. The 
generic form of BIC is 

BIC = -2 ■ loglik + (log N) ■ d. (7.35) 


The BIC statistic (times 1/2) is also known as the Schwarz criterion (Schwarz, 
1978). 

Under the Gaussian model, assuming the variance a 2 is known, -2-loglik 
equals (up to a constant) JA(j/j — f( x i)) 2 / a e’ which is N-Wr/a 2 for squared 
error loss. Hence we can write 


BIC=^ 


err + (log TV) • —a 2 


(7.36) 


Therefore BIC is proportional to AIC (Cp), with the factor 2 replaced 
by log IV. Assuming N > e 2 « 7.4, BIC tends to penalize complex models 
more heavily, giving preference to simpler models in selection. As with AIC, 
a 2 is typically estimated by the mean squared error of a low-bias model. 
For classification problems, use of the multinomial log-likelihood leads to a 
similar relationship with the AIC, using cross-entropy as the error measure. 
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Note however that the misclassification error measure does not arise in the 
BIC context, since it does not correspond to the log-likelihood of the data 
under any probability model. 

Despite its similarity with AIC, BIC is motivated in quite a different 
way. It arises in the Bayesian approach to model selection, which we now 
describe. 

Suppose we have a set of candidate models M m ,rn = 1 and 

corresponding model parameters 9 mi and we wish to choose a best model 
from among them. Assuming we have a prior distribution Pr(0 m |A4 m ) for 
the parameters of each model M m > the posterior probability of a given 
model is 


Pr(M m |Z) <x Pr(M m ) • Pr(Z|M m ) 


(7.37) 


oc Pr (M m ) 


Pr(Z|0 m , M m )Pv(e m \M m )de m , 


where Z represents the training data {xi,yi}^. To compare two models 
M m and Mi, we form the posterior odds 


Pr(A4 m |Z) _ Pi(Mm) Pr(Z|A4 m ) 
Pr(M f |Z) Pt(Mi) ' Pr(Z| M t ) ' 


If the odds are greater than one we choose model m , otherwise we choose 
model The rightmost quantity 


BF(Z) 


Pr(Z| M m ) 
Pr(Z| Me) 


(7.39) 


is called the Bayes factor , the contribution of the data toward the posterior 
odds. 

Typically we assume that the prior over models is uniform, so that 
Pr(A4 m ) is constant. We need some way of approximating Pr(Z|A4 m ). 
A so-called Laplace approximation to the integral followed by some other 
simplifications (Ripley, 1996, page 64) to (7.37) gives 


logPr(Z| M m ) = logPr(Z| § m ,M m ) ~ ^ ■ logIV + 0(1). (7.40) 

Here 9 m is a maximum likelihood estimate and d m is the number of free 
parameters in model M m . If we define our loss function to be 

-21ogPr(Z|0 m ,7W m ), 


this is equivalent to the BIC criterion of equation (7.35). 

Therefore, choosing the model with minimum BIC is equivalent to choos¬ 
ing the model with largest (approximate) posterior probability. But this 
framework gives us more. If we compute the BIC criterion for a set of M, 
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models, giving BIC m , m = 1,2,..., M, then we can estimate the posterior 
probability of each model M m as 



(7.41) 


Thus we can estimate not only the best model, but also assess the relative 
merits of the models considered. 

For model selection purposes, there is no clear choice between AIC and 
BIC. BIC is asymptotically consistent as a selection criterion. What this 
means is that given a family of models, including the true model, the prob¬ 
ability that BIC will select the correct model approaches one as the sample 
size N —> oo. This is not the case for AIC, which tends to choose models 
which are too complex as TV —> oo. On the other hand, for finite samples, 
BIC often chooses models that are too simple, because of its heavy penalty 
on complexity. 

7.8 Minimum Description Length 

The minimum description length (MDL) approach gives a selection cri¬ 
terion formally identical to the BIC approach, but is motivated from an 
optimal coding viewpoint. We first review the theory of coding for data 
compression, and then apply it to model selection. 

We think of our datum z as a message that we want to encode and 
send to someone else (the “receiver”). We think of our model as a way of 
encoding the datum, and will choose the most parsimonious model, that is 
the shortest code, for the transmission. 

Suppose first that the possible messages we might want to transmit are 
Zi, Z 2 , ■ ■ ■, z m . Our code uses a finite alphabet of length A: for example, we 
might use a binary code {0,1} of length A = 2. Here is an example with 
four possible messages and a binary coding: 


Message z i z-i z 3 24 


(7.42) 


Code 0 10 110 111 


This code is known as an instantaneous prefix code: no code is the pre¬ 
fix of any other, and the receiver (who knows all of the possible codes), 
knows exactly when the message has been completely sent. We restrict our 
discussion to such instantaneous prefix codes. 

One could use the coding in (7.42) or we could permute the codes, for 
example use codes 110,10, 111, 0 for 21 , 22 , 23 , 24 . How do we decide which 
to use? It depends on how often we will be sending each of the messages. 
If, for example, we will be sending 21 most often, it makes sense to use the 
shortest code 0 for z\. Using this kind of strategy—shorter codes for more 
frequent messages—the average message length will be shorter. 









236 


7. Model Assessment and Selection 


In general, if messages are sent with probabilities Pr(zj), i = 1, 2,..., 4, 
a famous theorem due to Shannon says we should use code lengths /, = 
— log 2 Pr(zj) and the average message length satisfies 

E(length) > - ^ Pr(^) log 2 (Pr(zj)). (7.43) 

The right-hand side above is also called the entropy of the distribution 
Pr(zj). The inequality is an equality when the probabilities satisfy pi = 
A~ li . In our example, if Pr(zj) = 1/2,1/4,1/8,1/8, respectively, then the 
coding shown in (7.42) is optimal and achieves the entropy lower bound. 

In general the lower bound cannot be achieved, but procedures like the 
Huffman coding scheme can get close to the bound. Note that with an 
infinite set of messages, the entropy is replaced by — / Pr(z) log 2 Pr (z)dz. 

From this result we glean the following: 

To transmit a random variable z having probability density func¬ 
tion Pr(z), we require about — log 2 Pr(z) bits of information. 

We henceforth change notation from log 2 Pr(z) to logPr(z) = log e Pr(z); 
this is for convenience, and just introduces an unimportant multiplicative 
constant. 

Now we apply this result to the problem of model selection. We have 
a model M with parameters 6 , and data Z = (X,y) consisting of both 
inputs and outputs. Let the (conditional) probability of the outputs under 
the model be Pr(y|0,M, X), assume the receiver knows all of the inputs, 
and we wish to transmit the outputs. Then the message length required to 
transmit the outputs is 

length = — logPr(y|0, M, X) — log Pr(0|M), (7.44) 


the log-probability of the target values given the inputs. The second term 
is the average code length for transmitting the model parameters 9, while 
the first term is the average code length for transmitting the discrepancy 
between the model and actual target values. For example suppose we have 
a single target y with y ~ iV(0,<j 2 ), parameter 9 ~ iV(0,1) and no input 
(for simplicity). Then the message length is 


length 


constant + log er + 


(:v ~ Of 

2 a 2 


(P 

T' 


(7.45) 


Note that the smaller a is, the shorter on average is the message length, 
since y is more concentrated around 9. 

The MDL principle says that we should choose the model that mini¬ 
mizes (7.44). We recognize (7.44) as the (negative) log-posterior distribu¬ 
tion, and hence minimizing description length is equivalent to maximizing 
posterior probability. Hence the BIC criterion, derived as approximation to 
log-posterior probability, can also be viewed as a device for (approximate) 
model choice by minimum description length. 
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FIGURE 7.5. The solid curve is the function sin(50x) for x £ [0, 1]. The green 
(solid) and blue (hollow) points illustrate how the associated indicator function 
/(sin(ax) > 0) can shatter (separate) an arbitrarily large number of points by 
choosing an appropriately high frequency a. 

Note that we have ignored the precision with which a random variable 
z is coded. With a finite code length we cannot code a continuous variable 
exactly. However, if we code z within a tolerance Sz, the message length 
needed is the log of the probability in the interval [z, z+6z] which is well ap¬ 
proximated by <5zPr(z) if 5z is small. Since logJzPr(z) = log<5z + logPr(z), 
this means we can just ignore the constant log<5z and use logPr(z) as our 
measure of message length, as we did above. 

The preceding view of MDL for model selection says that we should 
choose the model with highest posterior probability. However, many Bayes- 
ians would instead do inference by sampling from the posterior distribution. 


7.9 Vapnik-Chervonenkis Dimension 

A difficulty in using estimates of in-sample error is the need to specify the 
number of parameters (or the complexity) d used in the fit. Although the 
effective number of parameters introduced in Section 7.6 is useful for some 
nonlinear models, it is not fully general. The Vapnik-Chervonenkis (VC) 
theory provides such a general measure of complexity, and gives associated 
bounds on the optimism. Here we give a brief review of this theory. 

Suppose we have a class of functions {/(x, a)} indexed by a parameter 
vector a, with x £ 1R P . Assume for now that / is an indicator function, 
that is, takes the values 0 or 1. If a = (ao,«i) and / is the linear indi¬ 
cator function I (a o + afx > 0 ), then it seems reasonable to say that the 
complexity of the class / is the number of parameters p + 1. But what 
about f(x,a) = /(sin a ■ x) where a is any real number and x £ 1R? The 
function sin(50 • x) is shown in Figure 7.5. This is a very wiggly function 
that gets even rougher as the frequency a increases, but it has only one 
parameter: despite this, it doesn’t seem reasonable to conclude that it has 
less complexity than the linear indicator function I (a o + oqx) in p = 1 
dimension. 
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FIGURE 7.6. The first three panels show that the class of lines in the plane 
can shatter three points. The last panel shows that this class cannot shatter four 
points, as no line will put the hollow points on one side and the solid points on 
the other. Hence the VC dimension of the class of straight lines in the plane is 
three. Note that a class of nonlinear curves could shatter four points, and hence 
has VC dimension greater than three. 


The Vapnik-Chervonenkis dimension is a way of measuring the com¬ 
plexity of a class of functions by assessing how wiggly its members can 


be. 


The VC dimension of the class {f(x,a)} is defined to be the 
largest number of points (in some configuration) that can be 
shattered by members of {f(x,a)}. 


A set of points is said to be shattered by a class of functions if, no matter 
how we assign a binary label to each point, a member of the class can 
perfectly separate them. 

Figure 7.6 shows that the VC dimension of linear indicator functions 
in the plane is 3 but not 4, since no four points can be shattered by a 
set of lines. In general, a linear indicator function in p dimensions has VC 
dimension p+ 1, which is also the number of free parameters. On the other 
hand, it can be shown that the family sin(ax) has infinite VC dimension, 
as Figure 7.5 suggests. By appropriate choice of a, any set of points can be 
shattered by this class (Exercise 7.8). 

So far we have discussed the VC dimension only of indicator functions, 
but this can be extended to real-valued functions. The VC dimension of a 
class of real-valued functions {g(x, a)} is defined to be the VC dimension 
of the indicator class {I(g(x,a) — > 0)}, where /3 takes values over the 

range of g. 

One can use the VC dimension in constructing an estimate of (extra¬ 
sample) prediction error; different types of results are available. Using the 
concept of VC dimension, one can prove results about the optimism of the 
training error when using a class of functions. An example of such a result is 
the following. If we fit N training points using a class of functions {f(x, a)} 
having VC dimension h , then with probability at least 1 — rj over training 
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sets: 



( 1 + V^ 



(regression) 


(7.46) 


where e = a\ 


Mlog (a 2 iV/fe) + 1] - log (t?/4) 
N 


and 0 < ai < 4, 0 < a 2 < 2 


These bounds hold simultaneously for all members /(x,a), and are taken 
from Cherkassky and Mulier (2007, pages 116-118). They recommend the 
value c = 1. For regression they suggest a± = a 2 = 1, and for classification 
they make no recommendation, with a\ = 4 and a 2 = 2 corresponding 
to worst-case scenarios. They also give an alternative practical bound for 
regression 


-l 



(7.47) 


Err 7 - < err 1 


+ 


with p = jj, which is free of tuning constants. The bounds suggest that the 
optimism increases with h and decreases with N in qualitative agreement 
with the AIC correction d/N given in (7.24). However, the results in (7.46) 
are stronger: rather than giving the expected optimism for each fixed func¬ 
tion f(x,a), they give probabilistic upper bounds for all functions f(x,a), 
and hence allow for searching over the class. 

Vapnik’s structural risk minimization (SRM) approach fits a nested se¬ 
quence of models of increasing VC dimensions h\ < /i 2 < • • •, and then 
chooses the model with the smallest value of the upper bound. 

We note that upper bounds like the ones in (7.46) are often very loose, 
but that doesn’t rule them out as good criteria for model selection, where 
the relative (not absolute) size of the test error is important. The main 
drawback of this approach is the difficulty in calculating the VC dimension 
of a class of functions. Often only a crude upper bound for VC dimension 
is obtainable, and this may not be adequate. An example in which the 
structural risk minimization program can be successfully carried out is the 
support vector classifier, discussed in Section 12.2. 

7.9.1 Example (Continued) 

Figure 7.7 shows the results when AIC, BIC and SRM are used to select 
the model size for the examples of Figure 7.3. For the examples labeled KNN, 
the model index a refers to neighborhood size, while for those labeled REG a 
refers to subset size. Using each selection method (e.g., AIC) we estimated 
the best model a and found its true prediction error Err 7 -(d) on a test 
set. For the same training set we computed the prediction error of the best 
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AIC 



reg/KNN reg/linear class/KNN class/linear 


BIC 



reg/KNN reg/linear class/KNN class/linear 


SRM 



reg/KNN reg/linear class/KNN class/linear 


FIGURE 7.7. Boxplots show the distribution of the relative error 
100 x [Err-r(d) — min a Err'r(a:)]/[max ct Err-r/a) — min a Errr(«)] over the four 
scenarios of Figure 7.3. This is the error in using the chosen model relative to 
the best model. There are 100 training sets each of size 80 represented in each 
boxplot, with the errors computed on test sets of size 10,000. 
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and worst possible model choices: min Q Err 7 -(a) and max Q Err 7 -(a). The 
boxplots show the distribution of the quantity 

i no „ Err r (d) - min a Err r (a) 

J.uu ^ a ^ 

max a Err 7 -(a) — mm a Err 7 -(a) 

which represents the error in using the chosen model relative to the best 
model. For linear regression the model complexity was measured by the 
number of features; as mentioned in Section 7.5, this underestimates the 
df, since it does not charge for the search for the best model of that size. 
This was also used for the VC dimension of the linear classifier. For k- 
nearest neighbors, we used the quantity N/k. Under an additive-error re¬ 
gression model, this can be justified as the exact effective degrees of free¬ 
dom (Exercise 7.6); we do not know if it corresponds to the VC dimen¬ 
sion. We used a\ = 02 = 1 for the constants in (7.46); the results for SRM 
changed with different constants, and this choice gave the most favorable re¬ 
sults. We repeated the SRM selection using the alternative practical bound 
(7.47), and got almost identical results. For misclassification error we used 
d e 2 = [TV/(TV — d )] • err (a) for the least restrictive model (k = 5 for KNN, 
since k = 1 results in zero training error). The AIC criterion seems to work 
well in all four scenarios, despite the lack of theoretical support with 0-1 
loss. BIC does nearly as well, while the performance of SRM is mixed. 


7.10 Cross-Validation 

Probably the simplest and most widely used method for estimating predic¬ 
tion error is cross-validation. This method directly estimates the expected 
extra-sample error Err = E [L(Y, f(X))], the average generalization error 
when the method f(X) is applied to an independent test sample from the 
joint distribution of X and Y. As mentioned earlier, we might hope that 
cross-validation estimates the conditional error, with the training set T 
held fixed. But as we will see in Section 7.12, cross-validation typically 
estimates well only the expected prediction error. 


7.10.1 K-Fold Cross-Validation 

Ideally, if we had enough data, we would set aside a validation set and use 
it to assess the performance of our prediction model. Since data are often 
scarce, this is usually not possible. To finesse the problem, If-fold cross- 
validation uses part of the available data to fit the model, and a different 
part to test it. We split the data into K roughly equal-sized parts; for 
example, when K = 5, the scenario looks like this: 
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Train 

Train 

Validation 

Train 

Train 


For the fcth part (third above), we fit the model to the other K — 1 parts 
of the data, and calculate the prediction error of the fitted model when 
predicting the fcth part of the data. We do this for fc = 1,2,... ,K and 
combine the K estimates of prediction error. 

Here are more details. Let k : {1,..., N} {1,..., K} be an indexing 
function that indicates the partition to which observation i is allocated by 
the randomization. Denote by f~ k (x) the fitted function, computed with 
the fcth part of the data removed. Then the cross-validation estimate of 
prediction error is 

1 N 

CV(/) = (7.48) 

v i=i 

Typical choices of K are 5 or 10 (see below). The case K = TV is known 
as leave-one-out cross-validation. In this case n(i) = i, and for the *th 
observation the fit is computed using all the data except the *th. 

Given a set of models f(x,a) indexed by a tuning parameter a, denote 
by f~ k (x, a) the ath model fit with the fcth part of the data removed. Then 
for this set of models we define 

1 N 

C V(j>) = ^ L (y i ,f-^(x i ,a)). (7.49) 

i—1 

The function CV(/, a) provides an estimate of the test error curve, and we 
find the tuning parameter a that minimizes it. Our final chosen model is 
f(x, a), which we then fit to all the data. 

It is interesting to wonder about what quantity IT-fold cross-validation 
estimates. With K = 5 or 10, we might guess that it estimates the ex¬ 
pected error Err, since the training sets in each fold are quite different 
from the original training set. On the other hand, if K = N we might 
guess that cross-validation estimates the conditional error Err 7 -. It turns 
out that cross-validation only estimates effectively the average error Err, 
as discussed in Section 7.12. 

What value should we choose for IT? With K = N, the cross-validation 
estimator is approximately unbiased for the true (expected) prediction er¬ 
ror, but can have high variance because the N “training sets” are so similar 
to one another. The computational burden is also considerable, requiring 
N applications of the learning method. In certain special problems, this 
computation can be done quickly—see Exercises 7.3 and 5.13. 
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Size of Training Set 

FIGURE 7.8. Hypothetical learning curve for a classifier on a given task: a 
plot of 1 — Err versus the size of the training set N. With a dataset of 200 
observations, 5-fold cross-validation would use training sets of size 160, which 
would behave much like the full set. However, with a dataset of 50 observations 
fivefold cross-validation would use training sets of size 40, and this would result 
in a considerable overestimate of prediction error. 


On the other hand, with K = 5 say, cross-validation has lower variance. 
But bias could be a problem, depending on how the performance of the 
learning method varies with the size of the training set. Figure 7.8 shows 
a hypothetical “learning curve” for a classifier on a given task, a plot of 
1 — Err versus the size of the training set N. The performance of the 
classifier improves as the training set size increases to 100 observations; 
increasing the number further to 200 brings only a small benefit. If our 
training set had 200 observations, fivefold cross-validation would estimate 
the performance of our classifier over training sets of size 160, which from 
Figure 7.8 is virtually the same as the performance for training set size 
200. Thus cross-validation would not suffer from much bias. However if the 
training set had 50 observations, fivefold cross-validation would estimate 
the performance of our classifier over training sets of size 40, and from the 
figure that would be an underestimate of 1 — Err. Hence as an estimate of 
Err, cross-validation would be biased upward. 

To summarize, if the learning curve has a considerable slope at the given 
training set size, five- or tenfold cross-validation will overestimate the true 
prediction error. Whether this bias is a drawback in practice depends on 
the objective. On the other hand, leave-one-out cross-validation has low 
bias but can have high variance. Overall, five- or tenfold cross-validation 
are recommended as a good compromise: see Breiman and Spector (1992) 
and Kohavi (1995). 

Figure 7.9 shows the prediction error and tenfold cross-validation curve 
estimated from a single training set, from the scenario in the bottom right 
panel of Figure 7.3. This is a two-class classification problem, using a lin- 
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FIGURE 7.9. Prediction error (orange) and tenfold cross-validation curve 
(blue) estimated from a single training set, from the scenario in the bottom right 
panel of Figure 7. 3. 


ear model with best subsets regression of subset size p. Standard error bars 
are shown, which are the standard errors of the individual misclassification 
error rates for each of the ten parts. Both curves have minima at p = 10, 
although the CV curve is rather flat beyond 10. Often a “one-standard 
error” rule is used with cross-validation, in which we choose the most par¬ 
simonious model whose error is no more than one standard error above 
the error of the best model. Here it looks like a model with about p = 9 
predictors would be chosen, while the true model uses p = 10. 

Generalized cross-validation provides a convenient approximation to leave- 
one out cross-validation, for linear fitting under squared-error loss. As de¬ 
fined in Section 7.6, a linear fitting method is one for which we can write 

y = Sy. (7.50) 


Now for many linear fitting methods, 


N 


jj'E'bi-f i{ - Xi 


2=1 


1 

N 


N 


Vi - f{xi) 


2=1 


-Si, 


(7.51) 


where Su is the ith diagonal element of S (see Exercise 7.3). The GCV 
approximation is 


N 


Gcv tf> = ]vE 


2=1 


Vi - fix,) 

1 — trace(S)/AT 


(7.52) 











7.10 Cross-Validation 


245 


The quantity trace(S) is the effective number of parameters, as defined in 
Section 7.6. 

GCV can have a computational advantage in some settings, where the 
trace of S can be computed more easily than the individual elements Su. 
In smoothing problems, GCV can also alleviate the tendency of cross- 
validation to undersmooth. The similarity between GCV and AIC can be 
seen from the approximation 1/(1 — x) 2 ss 1 + 2x (Exercise 7.7). 

7.10.2 The Wrong and Right Way to Do Cross-validation 

Consider a classification problem with a large number of predictors, as may 
arise, for example, in genomic or proteomic applications. A typical strategy 
for analysis might be as follows: 

1. Screen the predictors: find a subset of “good” predictors that show 
fairly strong (univariate) correlation with the class labels 

2. Using just this subset of predictors, build a multivariate classifier. 

3. Use cross-validation to estimate the unknown tuning parameters and 
to estimate the prediction error of the final model. 

Is this a correct application of cross-validation? Consider a scenario with 
N = 50 samples in two equal-sized classes, and p = 5000 quantitative 
predictors (standard Gaussian) that are independent of the class labels. 
The true (test) error rate of any classifier is 50%. We carried out the above 
recipe, choosing in step ( 1 ) the 100 predictors having highest correlation 
with the class labels, and then using a 1 -nearest neighbor classifier, based 
on just these 100 predictors, in step (2). Over 50 simulations from this 
setting, the average CV error rate was 3%. This is far lower than the true 
error rate of 50%. 

What has happened? The problem is that the predictors have an unfair 
advantage, as they were chosen in step ( 1 ) on the basis of all of the samples. 
Leaving samples out after the variables have been selected does not cor¬ 
rectly mimic the application of the classifier to a completely independent 
test set, since these predictors “have already seen” the left out samples. 

Figure 7.10 (top panel) illustrates the problem. We selected the 100 pre¬ 
dictors having largest correlation with the class labels over all 50 samples. 
Then we chose a random set of 10 samples, as we would do in five-fold cross- 
validation, and computed the correlations of the pre-selected 100 predictors 
with the class labels over just these 10 samples (top panel). We see that 
the correlations average about 0.28, rather than 0 , as one might expect. 
Here is the correct way to carry out cross-validation in this example: 

1. Divide the samples into K cross-validation folds (groups) at random. 

2. For each fold k = 1, 2,... , K 
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Wrong way 



Correlations of Selected Predictors with Outcome 


Right way 



Correlations of Selected Predictors with Outcome 


FIGURE 7.10. Cross-validation the wrong and right way: histograms shows the 
correlation of class labels, in 10 randomly chosen samples, with the 100 predic¬ 
tors chosen using the incorrect (upper red) and correct (lower green) versions of 
cross-validation. 

(a) Find a subset of “good” predictors that show fairly strong (uni¬ 
variate) correlation with the class labels, using all of the samples 
except those in fold k. 

(b) Using just this subset of predictors, build a multivariate classi¬ 
fier, using all of the samples except those in fold k. 

(c) Use the classifier to predict the class labels for the samples in 
fold k. 

The error estimates from step 2(c) are then accumulated over all K folds, to 
produce the cross-validation estimate of prediction error. The lower panel 
of Figure 7.10 shows the correlations of class labels with the 100 predictors 
chosen in step 2(a) of the correct procedure, over the samples in a typical 
fold k. We see that they average about zero, as they should. 

In general, with a multistep modeling procedure, cross-validation must 
be applied to the entire sequence of modeling steps. In particular, samples 
must be “left out” before any selection or filtering steps are applied. There 
is one qualification: initial unsupervised screening steps can be done be¬ 
fore samples are left out. For example, we could select the 1000 predictors 
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with highest variance across all 50 samples, before starting cross-validation. 
Since this filtering does not involve the class labels, it does not give the 
predictors an unfair advantage. 

While this point may seem obvious to the reader, we have seen this 
blunder committed many times in published papers in top rank journals. 
With the large numbers of predictors that are so common in genomic and 
other areas, the potential consequences of this error have also increased 
dramatically; see Ambroise and McLachlan (2002) for a detailed discussion 
of this issue. 


7.10.3 Does Cross-Validation Really Work? 

We once again examine the behavior of cross-validation in a high-dimensional 
classification problem. Consider a scenario with N = 20 samples in two 
equal-sized classes, and p = 500 quantitative predictors that are indepen¬ 
dent of the class labels. Once again, the true error rate of any classifier is 
50%. Consider a simple univariate classifier: a single split that minimizes 
the misclassification error (a “stump”). Stumps are trees with a single split, 
and are used in boosting methods (Chapter 10). A simple argument sug¬ 
gests that cross-validation will not work properly in this setting 2 : 

Fitting to the entire training set, we will find a predictor that 
splits the data very well. If we do 5-fold cross-validation, this 
same predictor should split any 4/5 ths and l/5th of the data 
well too, and hence its cross-validation error will be small (much 
less than 50%.) Thus CV does not give an accurate estimate of 
error. 

To investigate whether this argument is correct, Figure 7.11 shows the 
result of a simulation from this setting. There are 500 predictors and 20 
samples, in each of two equal-sized classes, with all predictors having a 
standard Gaussian distribution. The panel in the top left shows the number 
of training errors for each of the 500 stumps fit to the training data. We 
have marked in color the six predictors yielding the fewest errors. In the top 
right panel, the training errors are shown for stumps fit to a random 4/5ths 
partition of the data (16 samples), and tested on the remaining l/5th (four 
samples). The colored points indicate the same predictors marked in the 
top left panel. We see that the stump for the blue predictor (whose stump 
was the best in the top left panel), makes two out of four test errors (50%), 
and is no better than random. 

What has happened? The preceding argument has ignored the fact that 
in cross-validation, the model must be completely retrained for each fold 


2 This argument was made to us by a scientist at a proteomics lab meeting, and led 
to material in this section. 



248 


7. Model Assessment and Selection 



Predictor 436 (blue) 


CV Errors 


FIGURE 7.11. Simulation study to investigate the performance of cross vali¬ 
dation in a high-dimensional problem where the predictors are independent of the 
class labels. The top-left panel shows the number of errors made by individual 
stump classifiers on the full training set (20 observations). The top right panel 
shows the errors made by individual stumps trained on a random split of the 
dataset into 4/5 ths (16 observations) and tested on the remaining l/5th (4 ob¬ 
servations). The best performers are depicted by colored dots in each panel. The 
bottom left panel shows the effect of re-estimating the split point in each fold: the 
colored points correspond to the four samples in the 4/5ths validation set. The 
split point derived from the full dataset classifies all four samples correctly, but 
when the split point is re-estimated on the 4/5 ths data (as it should be), it com¬ 
mits two errors on the four validation samples. In the bottom right we see the 
overall result of five-fold cross-validation applied to 50 simulated datasets. The 
average error rate is about 50%, as it should be. 
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of the process. In the present example, this means that the best predictor 
and corresponding split point are found from 4/5ths of the data. The effect 
of predictor choice is seen in the top right panel. Since the class labels are 
independent of the predictors, the performance of a stump on the 4/5ths 
training data contains no information about its performance in the remain¬ 
ing l/5th. The effect of the choice of split point is shown in the bottom left 
panel. Here we see the data for predictor 436, corresponding to the blue 
dot in the top left plot. The colored points indicate the l/5th data, while 
the remaining points belong to the 4/5ths. The optimal split points for this 
predictor based on both the full training set and 4/5ths data are indicated. 
The split based on the full data makes no errors on the l/5ths data. But 
cross-validation must base its split on the 4/5ths data, and this incurs two 
errors out of four samples. 

The results of applying five-fold cross-validation to each of 50 simulated 
datasets is shown in the bottom right panel. As we would hope, the average 
cross-validation error is around 50%, which is the true expected prediction 
error for this classifier. Hence cross-validation has behaved as it should. 
On the other hand, there is considerable variability in the error, underscor¬ 
ing the importance of reporting the estimated standard error of the CV 
estimate. See Exercise 7.10 for another variation of this problem. 


7.11 Bootstrap Methods 


The bootstrap is a general tool for assessing statistical accuracy. First we 
describe the bootstrap in general, and then show how it can be used to 
estimate extra-sample prediction error. As with cross-validation, the boot¬ 
strap seeks to estimate the conditional error Err 7 -, but typically estimates 
well only the expected prediction error Err. 

Suppose we have a model fit to a set of training data. We denote the 
training set by Z = (z\, Z 2 , ■ ■ ■, zjy) where Zi = ( Xi,yi ). The basic idea is 
to randomly draw datasets with replacement from the training data, each 
sample the same size as the original training set. This is done B times 
(. B = 100 say), producing B bootstrap datasets, as shown in Figure 7.12. 
Then we refit the model to each of the bootstrap datasets, and examine 
the behavior of the fits over the B replications. 

In the figure, S( Z) is any quantity computed from the data Z, for ex¬ 
ample, the prediction at some input point. From the bootstrap sampling 
we can estimate any aspect of the distribution of S( Z), for example, its 
variance, 



(7.53) 
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FIGURE 7.12. Schematic of the bootstrap process. We wish to assess the sta¬ 
tistical accuracy of a quantity S{ Z) computed from our dataset. B training sets 
Z* b , b — 1,..., B each of size N are drawn with replacement from the original 
dataset. The quantity of interest S(Z) is computed from each bootstrap training 
set, and the values S( Z* 1 ),..., S(Z* B ) are used to assess the statistical accuracy 
ofS{ Z). 


where S* = S(Z* b )/B. Note that Var[S(Z)] can be thought of as a 

Monte-Carlo estimate of the variance of 5(Z) under sampling from the 
empirical distribution function F for the data (zi, Z 2 ,..., zjv). 

How can we apply the bootstrap to estimate prediction error? One ap¬ 
proach would be to fit the model in question on a set of bootstrap samples, 
and then keep track of how well it predicts the original training set. If 
f* b (xi ) is the predicted value at afrom the model fitted to the 6th boot¬ 
strap dataset, our estimate is 

_ i 1 B N 

Err boo t = -g ( 7 - 54 ) 

6=1 2 = 1 

However, it is easy to see that Errboot does not provide a good estimate in 
general. The reason is that the bootstrap datasets are acting as the training 
samples, while the original training set is acting as the test sample, and 
these two samples have observations in common. This overlap can make 
overfit predictions look unrealistically good, and is the reason that cross- 
validation explicitly uses non-overlapping data for the training and test 
samples. Consider for example a 1-nearest neighbor classifier applied to a 
two-class classification problem with the same number of observations in 
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each class, in which the predictors and class labels are in fact independent. 
Then the true error rate is 0.5. But the contributions to the bootstrap 
estimate Errboot will be zero unless the observation i does not appear in the 
bootstrap sample b. In this latter case it will have the correct expectation 
0.5. Now 

( 1 \ N 

Pr{observation i g bootstrap sample b} = 1 — ^1 — — J 

« 1-e- 1 

= 0.632. (7.55) 

Hence the expectation of Errboot is about 0.5 x 0.368 = 0.184, far below 
the correct error rate 0.5. 

By mimicking cross-validation, a better bootstrap estimate can be ob¬ 
tained. For each observation, we only keep track of predictions from boot¬ 
strap samples not containing that observation. The leave-one-out bootstrap 
estimate of prediction error is defined by 

1 N 

E " = V £ IcVT £ £(»./■*(*.)). (7-56) 

i= 1 1 1 b&C~ i 

Here C~ l is the set of indices of the bootstrap samples b that do not contain 

_. _ -—-(i) 

observation i, and |C l \ is the number of such samples. In computing Err , 
we either have to choose B large enough to ensure that all of the \C~ l \ are 
greater than zero, or we can just leave out the terms in (7.56) corresponding 
to |C'-®|’s that are zero. 

The leave-one out bootstrap solves the overfitting problem suffered by 
Errboot) but has the training-set-size bias mentioned in the discussion of 
cross-validation. The average number of distinct observations in each boot¬ 
strap sample is about 0.632 • N, so its bias will roughly behave like that of 
twofold cross-validation. Thus if the learning curve has considerable slope 
at sample size N/2, the leave-one out bootstrap will be biased upward as 
an estimate of the true error. 

The “.632 estimator” is designed to alleviate this bias. It is defined by 

—(.632) -r—-(i) 

Err = .368 • err + .632 • Err . (7.57) 

The derivation of the .632 estimator is complex; intuitively it pulls the 
leave-one out bootstrap estimate down toward the training error rate, and 
hence reduces its upward bias. The use of the constant .632 relates to (7.55). 

The .632 estimator works well in “light fitting” situations, but can break 
down in overfit ones. Here is an example due to Breiman et al. (1984). 
Suppose we have two equal-size classes, with the targets independent of 
the class labels, and we apply a one-nearest neighbor rule. Then err = 0, 
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-- —j ^ j 

Err = 0.5 and so Err = .632 x 0.5 = .316. However, the true error 
rate is 0.5. 

One can improve the .632 estimator by taking into account the amount 
of overfitting. First we define 7 to be the no-information error rate: this 
is the error rate of our prediction rule if the inputs and class labels were 
independent. An estimate of 7 is obtained by evaluating the prediction rule 
on all possible combinations of targets yt and predictors 

N N 

A/ = n2 ( 7 - 58 ) 

For example, consider the dichotomous classification problem: let p± be 
the observed proportion of responses j/j equaling 1 , and let q\ be the ob¬ 
served proportion of predictions equaling 1. Then 

7 =Pi(l - Qi) + (1 ~Pi)qi- (7-59) 


With a rule like 1-nearest neighbors for which qi = p\ the value of 7 is 
2pi(l— pi). The multi-category generalization of (7.59) is 7 = )T) £ p^(l —g^). 
Using this, the relative overfitting rate is defined to be 


R = 


Err 


(i) 


— err 


7 — err 


(7.60) 


a quantity that ranges from 0 if there is no overfitting (Err = err) to 1 
if the overfitting equals the no-information value 7 — err. Finally, we define 
the “.632+” estimator by 


-r—(.632+) , . -~~(1) 

Err = (1 — w) ■ err + w ■ Err (7-61) 

. , , .632 

with w = -» . 

1 - .368 R 


~ - -(.632+) 

The weight w ranges from .632 if R = 0 to 1 if R = 1, so Err 

-—-(.632) -—-(1) 

ranges from Err to Err . Again, the derivation of (7.61) is compli¬ 
cated: roughly speaking, it produces a compromise between the leave-one- 
out bootstrap and the training error rate that depends on the amount of 
overfitting. For the 1-nearest-neighbor problem with class labels indepen- 

--.(.632+) -(1) 

dent of the inputs, w = R = 1, so Err = Err , which has the correct 

-(.632+) 

expectation of 0.5. In other problems with less overfitting, Err will 


lie somewhere between err and Err 


(i) 


7.11.1 Example (Continued) 

Figure 7.13 shows the results of tenfold cross-validation and the .632+ boot¬ 
strap estimate in the same four problems of Figures 7.7. As in that figure, 
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Cross-validation 



reg/KNN reg/linear class/KNN class/linear 


FIGURE 7.13. Boxplots show the distribution of the relative error 
100 • [Erra — min a Err(a)]/[max a , Err(a) — min a Err(a)] over the four scenar¬ 
ios of Figure 7.3. This is the error in using the chosen model relative to the best 
model. There are 100 training sets represented in each boxplot. 

Figure 7.13 shows boxplots of 100 • [Err^ — min a Err(a)]/[max Q Err(a) — 
min Q Err (a)], the error in using the chosen model relative to the best model. 
There are 100 different training sets represented in each boxplot. Both mea¬ 
sures perform well overall, perhaps the same or slightly worse than the AIC 
in Figure 7.7. 

Our conclusion is that for these particular problems and fitting methods, 
minimization of either AIC, cross-validation or bootstrap yields a model 
fairly close to the best available. Note that for the purpose of model selec¬ 
tion, any of the measures could be biased and it wouldn’t affect things, as 
long as the bias did not change the relative performance of the methods. 
For example, the addition of a constant to any of the measures would not 
change the resulting chosen model. However, for many adaptive, nonlinear 
techniques (like trees), estimation of the effective number of parameters is 
very difficult. This makes methods like AIC impractical and leaves us with 
cross-validation or bootstrap as the methods of choice. 

A different question is: how well does each method estimate test error? 
On the average the AIC criterion overestimated prediction error of its cho- 






























254 


7. Model Assessment and Selection 


sen model by 38%, 37%, 51%, and 30%, respectively, over the four scenarios, 
with BIC performing similarly. In contrast, cross-validation overestimated 
the error by 1%, 4%, 0%, and 4%, with the bootstrap doing about the 
same. Hence the extra work involved in computing a cross-validation or 
bootstrap measure is worthwhile, if an accurate estimate of test error is 
required. With other fitting methods like trees, cross-validation and boot¬ 
strap can underestimate the true error by 10 %, because the search for best 
tree is strongly affected by the validation set. In these situations only a 
separate test set will provide an unbiased estimate of test error. 


7.12 Conditional or Expected Test Error? 

Figures 7.14 and 7.15 examine the question of whether cross-validation does 
a good job in estimating Err 7 -, the error conditional on a given training set 
T (expression (7.15) on page 228), as opposed to the expected test error. 
For each of 100 training sets generated from the “reg/linear” setting in 
the top-right panel of Figure 7.3, Figure 7.14 shows the conditional error 
curves Err 7 - as a function of subset size (top left). The next two panels show 
10-fold and TV-fold cross-validation, the latter also known as leave-one-out 
(LOO). The thick red curve in each plot is the expected error Err, while 
the thick black curves are the expected cross-validation curves. The lower 
right panel shows how well cross-validation approximates the conditional 
and expected error. 

One might have expected TV-fold CV to approximate Err 7 - well, since it 
almost uses the full training sample to fit a new test point. 10-fold CV, on 
the other hand, might be expected to estimate Err well, since it averages 
over somewhat different training sets. From the figure it appears 10-fold 
does a better job than TV-fold in estimating Err 7 -, and estimates Err even 
better. Indeed, the similarity of the two black curves with the red curve 
suggests both CV curves are approximately unbiased for Err, with 10-fold 
having less variance. Similar trends were reported by Efron (1983). 

Figure 7.15 shows scatterplots of both 10-fold and TV-fold cross-validation 
error estimates versus the true conditional error for the 100 simulations. 
Although the scatterplots do not indicate much correlation, the lower right 
panel shows that for the most part the correlations are negative, a curi¬ 
ous phenomenon that has been observed before. This negative correlation 
explains why neither form of CV estimates Err 7 - well. The broken lines in 
each plot are drawn at Err(p), the expected error for the best subset of 
size p. We see again that both forms of CV are approximately unbiased for 
expected error, but the variation in test error for different training sets is 
quite substantial. 

Among the four experimental conditions in 7.3, this “reg/linear” scenario 
showed the highest correlation between actual and predicted test error. This 
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FIGURE 7.14. Conditional prediction-error Errj -, 10 -fold cross-validation, and 
leave-one-out cross-validation curves for a 100 simulations from the top-right 
panel in Figure 7.3. The thick red curve is the expected prediction error Err, 
while the thick black curves are the expected CV curves E 7 -CV 10 and E-rCViv. 
The lower-right panel shows the mean absolute deviation of the CV curves from 
the conditional error, E 7 -|CVk — Eri’T-| for K = 10 (blue) and K = N (green), 
as well as from the expected error E-r|CVio — Err| (orange). 
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FIGURE 7.15. Plots of the CV estimates of error versus the true conditional 
error for each of the 100 training sets, for the simulation setup in the top right 
panel Figure 7.3. Both 10 -fold and leave-one-out CV are depicted in different 
colors. The first three panels correspond to different subset sizes p, and vertical 
and horizontal lines are drawn at Err(p). Although there appears to be little cor¬ 
relation in these plots, we see in the lower right panel that for the most part the 
correlation is negative. 
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phenomenon also occurs for bootstrap estimates of error, and we would 
guess, for any other estimate of conditional prediction error. 

We conclude that estimation of test error for a particular training set is 
not easy in general, given just the data from that same training set. Instead, 
cross-validation and related methods may provide reasonable estimates of 
the expected error Err. 
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Key references for cross-validation are Stone (1974), Stone (1977) and 
Allen (1974). The AIC was proposed by Akaike (1973), while the BIC 
was introduced by Schwarz (1978). Madigan and Raftery (1994) give an 
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(1977) showed that the AIC and leave-one out cross-validation are asymp¬ 
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in the monograph by Wahba (1990). See also Hastie and Tibshirani (1990), 
Chapter 3. The bootstrap is due to Efron (1979); see Efron and Tibshi¬ 
rani (1993) for an overview. Efron (1983) proposes a number of bootstrap 
estimates of prediction error, including the optimism and .632 estimates. 
Efron (1986) compares CV, GCV and bootstrap estimates of error rates. 
The use of cross-validation and the bootstrap for model selection is stud¬ 
ied by Breiman and Spector (1992), Breiman (1992), Shao (1996), Zhang 
(1993) and Kohavi (1995). The .632+ estimator was proposed by Efron 
and Tibshirani (1997). 

Cherkassky and Ma (2003) published a study on the performance of 
SRM for model selection in regression, in response to our study of section 
7.9.1. They complained that we had been unfair to SRM because had not 
applied it properly. Our response can be found in the same issue of the 
journal (Hastie et al. (2003)). 


Exercises 

Ex. 7.1 Derive the estimate of in-sample error (7.24). 

Ex. 7.2 For 0-1 loss with Y G {0,1} and Pr(Y = l|x 0 ) = f(x o), show that 

Err(x 0 ) = Pr(V ^ G(x 0 )\X = x 0 ) 

= Err B (so) + |2/(x 0 ) - l|Pr(G(x 0 ) ^ G(x 0 )\X = x 0 ), 

(7.62) 
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where G(x) = I(f(x) > |), G(x) = I{f(x) > |) is the Bayes classifier, 
and ErrB(xo) = Pr(Y ^ G(xo)|X = xo), the irreducible Bayes error at xq. 
Using the approximation f(x o) ~ N(Ef(xo), Yar(f(xo)), show that 

Pr(G(*„) * G(xq)\X = «„)»*('KUW-py (7 63) 

V Y Var(/(a;o)) / 

In the above, 

1 f 4 

$(t) = —== / exp(—t 2 /2)dt, 

v "TT J — oo 

the cumulative Gaussian distribution function. This is an increasing func¬ 
tion, with value 0 at t = —oo and value 1 at t = +oo. 

We can think of sign(t — f(xo))(Ef(xo) — |) as a kind of boundary- 
bias term, as it depends on the true f(x o) only through which side of the 
boundary (|) that it lies. Notice also that the bias and variance combine 
in a multiplicative rather than additive fashion. If E/(a;o) is on the same 
side of | as f{x o), then the bias is negative, and decreasing the variance 
will decrease the misclassification error. On the other hand, if Ef(xo) is 
on the opposite side of \ to f{x o), then the bias is positive and it pays to 
increase the variance! Such an increase will improve the chance that f(x o) 
falls on the correct side of ^ (Friedman, 1997). 

Ex. 7.3 Let f = Sy be a linear smoothing of y. 

(a) If Sa is the «th diagonal element of S, show that for S arising from least 
squares projections and cubic smoothing splines, the cross-validated 
residual can be written as 


Vi~f l {xi) 


Vi ~ f(xj) 

1 - Su 


(7.64) 


(b) Use this result to show that \yi — > \yi — f(xi )|. 

(c) Find general conditions on any smoother S to make result (7.64) hold. 

Ex. 7.4 Consider the in-sample prediction error (7.18) and the training 
error err in the case of squared-error loss: 

^E E MY? - 
^ J2( y i ~ 


Err in = 

err = 
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Add and subtract f(xi ) and E/(x,) in each expression and expand. Hence 
establish that the average optimism in the training error is 


2 

N 


N 

E C ov(y i: y l ), 
i=1 


as given in (7.21). 

Ex. 7.5 For a linear smoother y = Sy, show that 
N 

^ Co v(iii,yi) = trace(S)cx 2 , (7.65) 

i=i 

which justifies its use as the effective number of parameters. 

Ex. 7.6 Show that for an additive-error model, the effective degrees-of- 
freedom for the fc-nearest-neighbors regression fit is N/k. 

Ex. 7.7 Use the approximation 1/(1—x) 2 ~ l+2x to expose the relationship 
between C p /AlC (7.26) and GCV (7.52), the main difference being the 
model used to estimate the noise variance er 2 . 

Ex. 7.8 Show that the set of functions {/(sin(ax) > 0)} can shatter the 
following points on the line: 

z 1 = 10~ 1 ,...,z e = 10-*, (7.66) 

for any L Hence the VC dimension of the class {/(sin(ax) > 0)} is infinite. 

Ex. 7.9 For the prostate data of Chapter 3, carry out a best-subset linear 
regression analysis, as in Table 3.3 (third column from left). Compute the 
AIC, BIC, five- and tenfold cross-validation, and bootstrap .632 estimates 
of prediction error. Discuss the results. 

Ex. 7.10 Referring to the example in Section 7.10.3, suppose instead that 
all of the p predictors are binary, and hence there is no need to estimate 
split points. The predictors are independent of the class labels as before. 
Then if p is very large, we can probably find a predictor that splits the 
entire training data perfectly, and hence would split the validation data 
(one-fifth of data) perfectly as well. This predictor would therefore have 
zero cross-validation error. Does this mean that cross-validation does not 
provide a good estimate of test error in this situation? [This question was 
suggested by Li Ma.] 
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Model Inference and Averaging 


This is page 261 
Printer: Opaque this 


8.1 Introduction 

For most of this book, the fitting (learning) of models has been achieved by 
minimizing a sum of squares for regression, or by minimizing cross-entropy 
for classification. In fact, both of these minimizations are instances of the 
maximum likelihood approach to fitting. 

In this chapter we provide a general exposition of the maximum likeli¬ 
hood approach, as well as the Bayesian method for inference. The boot¬ 
strap, introduced in Chapter 7, is discussed in this context, and its relation 
to maximum likelihood and Bayes is described. Finally, we present some 
related techniques for model averaging and improvement, including com¬ 
mittee methods, bagging, stacking and bumping. 


8.2 The Bootstrap and Maximum Likelihood 
Methods 

8.2.1 A Smoothing Example 

The bootstrap method provides a direct computational way of assessing 
uncertainty, by sampling from the training data. Here we illustrate the 
bootstrap in a simple one-dimensional smoothing problem, and show its 
connection to maximum likelihood. 
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x 


x 


FIGURE 8.1. (Left panel): Data for smoothing example. (Right panel:) Set of 
seven B-spline basis functions. The broken vertical lines indicate the placement 
of the three knots. 

Denote the training data by Z = {z±, Z 2 , ■ ■ ■, zn}, with z t = ( Xi,yi ), 
i = 1,2, ...,7V. Here Xi is a one-dimensional input, and yi the outcome, 
either continuous or categorical. As an example, consider the N = 50 data 
points shown in the left panel of Figure 8.1. 

Suppose we decide to fit a cubic spline to the data, with three knots 
placed at the quartiles of the X values. This is a seven-dimensional lin¬ 
ear space of functions, and can be represented, for example, by a linear 
expansion of 5-spline basis functions (see Section 5.9.2): 

7 

(*( x ) = ( 8 . 1 ) 

3 =1 

Here the hj(x), j = 1, 2,..., 7 are the seven functions shown in the right 
panel of Figure 8.1. We can think of n(x) as representing the conditional 
mean E(F|X = x). 

Let H be the TV x 7 matrix with ij th element hj(xi). The usual estimate 
of /3, obtained by minimizing the squared error over the training set, is 
given by 

b = (H T H)- 1 H T y. (8.2) 

The corresponding fit p,(x) = ]C/ =1 Pjhj( x ) is shown in the top left panel 
of Figure 8.2. 

The estimated covariance matrix of (3 is 

Vm(/3) = (H t H) - 1 <j 2 , (8.3) 

where we have estimated the noise variance by <j 2 = — T{xi)) 2 /N. 

Letting h(x) T = (h\(x), h 2 ( 2 ),..., hr(x)), the standard error of a predic- 
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FIGURE 8.2. (Top left:) B-spline smooth of data. (Top right:) B-spline smooth 
plus and minus 1.96x standard error bands. (Bottom left:) Ten bootstrap repli¬ 
cates of the B-spline smooth. (Bottom right:) B-spline smooth with 95% standard 
error bands computed from the bootstrap distribution. 
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tion j.i(x ) = h(x) T j3 is 

se[/i(x)] = [/i(a;) T (H T H) _1 /i(a;)]5(j. (8.4) 

In the top right panel of Figure 8.2 we have plotted (i{x) ± 1.96- se[/t(x)]. 
Since 1.96 is the 97.5% point of the standard normal distribution, these 
represent approximate 100 — 2 x 2.5% = 95% pointwise confidence bands 
for p(x). 

Here is how we could apply the bootstrap in this example. We draw B 
datasets each of size N = 50 with replacement from our training data, the 
sampling unit being the pair z-i = ( Xi,yi ). To each bootstrap dataset Z* 
we fit a cubic spline fi*(x)\ the fits from ten such samples are shown in the 
bottom left panel of Figure 8.2. Using B = 200 bootstrap samples, we can 
form a 95% pointwise confidence band from the percentiles at each x: we 
find the 2.5% x 200 = fifth largest and smallest values at each x. These are 
plotted in the bottom right panel of Figure 8.2. The bands look similar to 
those in the top right, being a little wider at the endpoints. 

There is actually a close connection between the least squares estimates 
(8.2) and (8.3), the bootstrap, and maximum likelihood. Suppose we further 
assume that the model errors are Gaussian, 

Y = ti(X)+e- e~N{0 : a 2 ), 

7 

m(z) = ^2fijhj(x)- (8.5) 

3 = 1 

The bootstrap method described above, in which we sample with re¬ 
placement from the training data, is called the nonparametric bootstrap. 
This really means that the method is “model-free,” since it uses the raw 
data, not a specific parametric model, to generate new datasets. Consider 
a variation of the bootstrap, called the parametric bootstrap , in which we 
simulate new responses by adding Gaussian noise to the predicted values: 

y* = fi{ Xi ) + £*■ £* ~ N(0, <j 2 ); i = 1,2,..., TV. (8.6) 

This process is repeated B times, where B = 200 say. The resulting boot¬ 
strap datasets have the form {x \,?/*), ■ . ., (xniVn) an d we recompute the 
H-spline smooth on each. The confidence bands from this method will ex¬ 
actly equal the least squares bands in the top right panel, as the number of 
bootstrap samples goes to infinity. A function estimated from a bootstrap 
sample y* is given by p,*(x) = /i(:r) T (H T H) -1 H T y*, and has distribution 

ji*(x) ~ N(p,(x), h(x) T (H T H)~ 1 h(x)a 2 ). (8.7) 

Notice that the mean of this distribution is the least squares estimate, and 
the standard deviation is the same as the approximate formula (8.4). 
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8.2.2 Maximum Likelihood Inference 

It turns out that the parametric bootstrap agrees with least squares in the 
previous example because the model (8.5) has additive Gaussian errors. In 
general, the parametric bootstrap agrees not with least squares but with 
maximum likelihood, which we now review. 

We begin by specifying a probability density or probability mass function 
for our observations 


zi~ge{z)- ( 8 . 8 ) 

In this expression 9 represents one or more unknown parameters that gov¬ 
ern the distribution of Z . This is called a parametric model for Z. As an 
example, if Z has a normal distribution with mean p and variance cr 2 , then 

0 = (/qa 2 ), (8.9) 


and 

ge ( z )='e-^-rf/° 2 . ( 8 . 10 ) 

V27T(j 

Maximum likelihood is based on the likelihood function , given by 

N 

L(6-,Z) = l[g e (z i ), (8.11) 

i=l 

the probability of the observed data under the model gg. The likelihood is 
defined only up to a positive multiplier, which we have taken to be one. 
We think of L(9 ; Z) as a function of 9 , with our data Z fixed. 

Denote the logarithm of L(9\ Z) by 

N 

m z) = E^) 

2—1 

N 

i—1 

which we will sometimes abbreviate as £{6). This expression is called the 
log-likelihood, and each value £(9-,Zi) = log gg(zi) is called a log-likelihood 
component. The method of maximum likelihood chooses the value 9 = 9 
to maximize £(9; Z). 

The likelihood function can be used to assess the precision of 9. We need 
a few more definitions. The score function is defined by 

N 

i(9; Z)=E« z -)> 

i— 1 


(8.13) 
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where £(9;Zi ) = d£{9\ zf)/dd. Assuming that the likelihood takes its maxi¬ 
mum in the interior of the parameter space, £(9; Z) = 0. The information 
matrix is 


1 ( 0 ) 


V' d 2 £{9\ zf) 
^ d0d0 T 

i= 1 


(8.14) 


When 1(0) is evaluated at 0 = 0, it is often called the observed information. 
The Fisher information (or expected information) is 

i(0) =E 0 [I(0)]. (8.15) 

Finally, let 0 O denote the true value of 0. 

A standard result says that the sampling distribution of the maximum 
likelihood estimator has a limiting normal distribution 

^ ^ JV'(0 o ,i(0o)- 1 ), ( 8 - 16 ) 

as N —> oo. Here we are independently sampling from gg 0 (z). This suggests 
that the sampling distribution of 0 may be approximated by 

N(9, i(0) -1 ) or N(0, 1(0) _1 ), (8.17) 

where 0 represents the maximum likelihood estimate from the observed 
data. 

The corresponding estimates for the standard errors of Oj are obtained 
from 


and yi(0)-/. (8.18) 

Confidence points for 8j can be constructed from either approximation 
in (8.17). Such a confidence point has the form 

Oj - * (1 ~ a) • V /i S7 or Oj - 

respectively, where z^~ a ^ is the 1 — a percentile of the standard normal 
distribution. More accurate confidence intervals can be derived from the 
likelihood function, by using the chi-squared approximation 

2 m ~ e(0o)] ~ 4, (8.19) 

where p is the number of components in 0. The resulting 1 — 2 a confi¬ 
dence interval is the set of all 0 O such that 2[£(0) — £{0q)\ < Xp ^ 2 °\ 

where Xp ^ 2 °^ is the 1 — 2a percentile of the chi-squared distribution with 
p degrees of freedom. 
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Let’s return to our smoothing example to see what maximum likelihood 
yields. The parameters are 9 = (/?, a 2 ). The log-likelihood is 

N 1 . N 

£(9) = - — logcr 2 27T - — - h{xi) T f3) 2 . (8.20) 

i= 1 

The maximum likelihood estimate is obtained by setting d£/df3 = 0 and 
d£/dcr 2 = 0, giving 

/? = (H T H) _ 1 H T y, 

1 ^ ( 8 - 21 ) 

o' = -A(M) 2 > 

which are the same as the usual estimates given in (8.2) and below (8.3). 

The information matrix for 9 = (/3,er 2 ) is block-diagonal, and the block 
corresponding to /? is 

1W) = (H T H)/a 2 , (8.22) 

so that the estimated variance (H t H) -1 <t 2 agrees with the least squares 
estimate (8.3). 

8.2.3 Bootstrap versus Maximum Likelihood 

In essence the bootstrap is a computer implementation of nonparametric or 
parametric maximum likelihood. The advantage of the bootstrap over the 
maximum likelihood formula is that it allows us to compute maximum like¬ 
lihood estimates of standard errors and other quantities in settings where 
no formulas are available. 

In our example, suppose that we adaptively choose by cross-validation 
the number and position of the knots that define the B-splines, rather 
than fix them in advance. Denote by A the collection of knots and their 
positions. Then the standard errors and confidence bands should account 
for the adaptive choice of A, but there is no way to do this analytically. 
With the bootstrap, we compute the i?-spline smooth with an adaptive 
choice of knots for each bootstrap sample. The percentiles of the resulting 
curves capture the variability from both the noise in the targets as well as 
that from A. In this particular example the confidence bands (not shown) 
don’t look much different than the fixed A bands. But in other problems, 
where more adaptation is used, this can be an important effect to capture. 


8.3 Bayesian Methods 

In the Bayesian approach to inference, we specify a sampling model Pr(Z|0) 
(density or probability mass function) for our data given the parameters, 
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and a prior distribution for the parameters Pr(0) reflecting our knowledge 
about 0 before we see the data. We then compute the posterior distribution 


Pr(0|Z) 


Pr(Z|0) • Pr(0) 
f Pr(Z|0) • Pr(0)d0 ’ 


(8.23) 


which represents our updated knowledge about 8 after we see the data. To 
understand this posterior distribution, one might draw samples from it or 
summarize by computing its mean or mode. The Bayesian approach differs 
from the standard (“frequentist”) method for inference in its use of a prior 
distribution to express the uncertainty present before seeing the data, and 
to allow the uncertainty remaining after seeing the data to be expressed in 
the form of a posterior distribution. 

The posterior distribution also provides the basis for predicting the values 
of a future observation z new , via the predictive distribution: 


Pr(z new |Z) = J Pr(z new |0) • Pr(0|Z)d0. (8.24) 

In contrast, the maximum likelihood approach would use Pr(z new |0), 
the data density evaluated at the maximum likelihood estimate, to predict 
future data. Unlike the predictive distribution (8.24), this does not account 
for the uncertainty in estimating 0. 

Let’s walk through the Bayesian approach in our smoothing example. 
We start with the parametric model given by equation (8.5), and assume 
for the moment that er 2 is known. We assume that the observed feature 
values x\,X 2 , ■ • ■ ,xn are fixed, so that the randomness in the data comes 
solely from y varying around its mean n(x). 

The second ingredient we need is a prior distribution. Distributions on 
functions are fairly complex entities: one approach is to use a Gaussian 
process prior in which we specify the prior covariance between any two 
function values p(x) and fx(x') (Wahba, 1990; Neal, 1996). 

Here we take a simpler route: by considering a finite H-spline basis for 
p.(x), we can instead provide a prior for the coefficients /?, and this implicitly 
defines a prior for p{x). We choose a Gaussian prior centered at zero 

/3~1V(0,t£) (8.25) 

with the choices of the prior correlation matrix £ and variance r to be 
discussed below. The implicit process prior for y(x) is hence Gaussian, 
with covariance kernel 


= cav[fj,(x), p,(x’)] 
= t ■ h(x) T Tih(x'). 


K(x , x') 


(8.26) 



8.3 Bayesian Methods 269 


a. ° 




CNJ 





0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


X 


FIGURE 8.3. Smoothing example: Ten draws from the Gaussian prior distri¬ 
bution for the function p(x). 

The posterior distribution for /3 is also Gaussian, with mean and covariance 





(8.27) 


with the corresponding posterior values for p,(a;), 





(8.28) 





How do we choose the prior correlation matrix S? In some settings the 
prior can be chosen from subject matter knowledge about the parameters. 
Here we are willing to say the function pt(x) should be smooth, and have 
guaranteed this by expressing /r in a smooth low-dimensional basis of B- 
splines. Hence we can take the prior correlation matrix to be the identity 
S = I. When the number of basis functions is large, this might not be suf¬ 
ficient, and additional smoothness can be enforced by imposing restrictions 
on E; this is exactly the case with smoothing splines (Section 5.8.1). 

Figure 8.3 shows ten draws from the corresponding prior for x). To 
generate posterior values of the function /i(*), we generate values /3' from its 
posterior (8.27), giving corresponding posterior value //( x) = Y2i Pjhj(x)- 
Ten such posterior curves are shown in Figure 8.4. Two different values 
were used for the prior variance r, 1 and 1000. Notice how similar the 
right panel looks to the bootstrap distribution in the bottom left panel 










r = 1000 
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r = 1 




FIGURE 8.4. Smoothing example: Ten draws from the posterior distribution 
for the function p(x), for two different values of the prior variance r. The purple 
curves are the posterior means. 

of Figure 8.2 on page 263. This similarity is no accident. As r —> oo, the 
posterior distribution (8.27) and the bootstrap distribution (8.7) coincide. 
On the other hand, for r = 1, the posterior curves pt{x) in the left panel 
of Figure 8.4 are smoother than the bootstrap curves, because we have 
imposed more prior weight on smoothness. 

The distribution (8.25) with r —> oo is called a noninformative prior for 
9. In Gaussian models, maximum likelihood and parametric bootstrap anal¬ 
yses tend to agree with Bayesian analyses that use a noninformative prior 
for the free parameters. These tend to agree, because with a constant prior, 
the posterior distribution is proportional to the likelihood. This correspon¬ 
dence also extends to the nonparametric case, where the nonparametric 
bootstrap approximates a noninformative Bayes analysis; Section 8.4 has 
the details. 

We have, however, done some things that are not proper from a Bayesian 
point of view. We have used a noninformative (constant) prior for a 2 and 
replaced it with the maximum likelihood estimate a 1 in the posterior. A 
more standard Bayesian analysis would also put a prior on a (typically 
g(a) oc 1 /cr), calculate a joint posterior for p(x) and a , and then integrate 
out cr, rather than just extract the maximum of the posterior distribution 
(“MAP” estimate). 
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8.4 Relationship Between the Bootstrap and 
Bayesian Inference 

Consider first a very simple example, in which we observe a single obser¬ 
vation z from a normal distribution 



z~ JV(0,1). 


(8.29) 


To carry out a Bayesian analysis for 9, we need to specify a prior. The 
most convenient and common choice would be 6 ~ N( 0, r) giving posterior 
distribution 


( “ 0) 

Now the larger we take r, the more concentrated the posterior becomes 
around the maximum likelihood estimate 9 = z. In the limit as r —> oo we 
obtain a noninformative (constant) prior, and the posterior distribution is 

0\z~N(z, 1). (8.31) 

This is the same as a parametric bootstrap distribution in which we gen¬ 
erate bootstrap values z* from the maximum likelihood estimate of the 
sampling density N(z, 1). 

There are three ingredients that make this correspondence work: 

1. The choice of noninformative prior for 9. 

2. The dependence of the log-likelihood i{9\ Z) on the data Z only 
through the maximum likelihood estimate 9. Hence we can write the 
log-likelihood as Z{9\ 9). 

3. The symmetry of the log-likelihood in 9 and 0, that is, i{0\ 9) = 
i(9\ 9) + constant. 

Properties (2) and (3) essentially only hold for the Gaussian distribu¬ 
tion. However, they also hold approximately for the multinomial distribu¬ 
tion, leading to a correspondence between the nonparametric bootstrap 
and Bayes inference, which we outline next. 

Assume that we have a discrete sample space with L categories. Let Wj be 
the probability that a sample point falls in category j, and Wj the observed 
proportion in category j. Let w = (w±,W2, ■ ■ ■, wl),w = (w\, u>2, ■ ■ ■, aii). 
Denote our estimator by S(vj); take as a prior distribution for w a sym¬ 
metric Dirichlet distribution with parameter a: 


w ~ Dii(al), 


(8.32) 
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that is, the prior probability mass function is proportional to nf=i< 1 - 
Then the posterior density of w is 

w ~ Dii(al + Nw), (8.33) 

where N is the sample size. Letting a —> 0 to obtain a noninformative prior 
gives 


w ~ Di l{Nw). (8.34) 

Now the bootstrap distribution, obtained by sampling with replacement 
from the data, can be expressed as sampling the category proportions from 
a multinomial distribution. Specifically, 

Nw* ~ Mult (IV, w), (8.35) 

where Mult (TV, w) denotes a multinomial distribution, having probability 
mass function ( N ^, N ») ]”[ w^ Wl . This distribution is similar to the pos¬ 
terior distribution above, having the same support, same mean, and nearly 
the same covariance matrix. Hence the bootstrap distribution of S{w*) will 
closely approximate the posterior distribution of S(w). 

In this sense, the bootstrap distribution represents an (approximate) 
nonparametric, noninformative posterior distribution for our parameter. 
But this bootstrap distribution is obtained painlessly—without having to 
formally specify a prior and without having to sample from the posterior 
distribution. Hence we might think of the bootstrap distribution as a “poor 
man’s” Bayes posterior. By perturbing the data, the bootstrap approxi¬ 
mates the Bayesian effect of perturbing the parameters, and is typically 
much simpler to carry out. 


8.5 The EM Algorithm 

The EM algorithm is a popular tool for simplifying difficult maximum 
likelihood problems. We first describe it in the context of a simple mixture 
model. 


8.5.1 Two-Component Mixture Model 

In this section we describe a simple mixture model for density estimation, 
and the associated EM algorithm for carrying out maximum likelihood 
estimation. This has a natural connection to Gibbs sampling methods for 
Bayesian inference. Mixture models are discussed and demonstrated in sev¬ 
eral other parts of the book, in particular Sections 6.8, 12.7 and 13.2.3. 

The left panel of Figure 8.5 shows a histogram of the 20 fictitious data 
points in Table 8.1. 
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FIGURE 8.5. Mixture example. (Left panel:) Histogram of data. (Right panel:) 
Maximum likelihood fit of Gaussian densities (solid red) and responsibility (dotted 
green) of the left component density for observation y, as a function of y. 


TABLE 8.1. Twenty fictitious data points used in the two-component mixture 
example in Figure 8.5. 


-0.39 0.12 0.94 1.67 1.76 2.44 3.72 4.28 4.92 5.53 

0.06 0.48 1.01 1.68 1.80 3.25 4.12 4.60 5.28 6.22 


We would like to model the density of the data points, and due to the 
apparent bi-modality, a Gaussian distribution would not be appropriate. 
There seems to be two separate underlying regimes, so instead we model 
Y as a mixture of two normal distributions: 


*1 

~ n (Ti , cr?), 


y 2 

~ A(^2,0-|), 

(8.36) 

Y 

= (1 — A) ■ Fi + A ■ Y 2 , 



where A £ {0,1} with Pr(A = 1) = n. This generative representation is 
explicit: generate a A £ {0,1} with probability 7r, and then depending on 
the outcome, deliver either Y) or Y 2 . Let 4>g(x) denote the normal density 
with parameters 9 = (p,,er 2 ). Then the density of Y is 

9y{v) = (1 - Tt)hi(y) + n(t>o 2 (y)- (8.37) 

Now suppose we wish to fit this model to the data in Figure 8.5 by maxi¬ 
mum likelihood. The parameters are 

9 = (7T,0i, 0 2 ) = (7T, /xi, crj, /x 2 , erf) - (8.38) 

The log-likelihood based on the N training cases is 
N 

z) = ^sK 1 - 7r )^i (Vi) + W>0 2 (2/1)] • 

i =1 


(8.39) 
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Direct maximization of Z) is quite difficult numerically, because of 

the sum of terms inside the logarithm. There is, however, a simpler ap¬ 
proach. We consider unobserved latent variables Aj taking values 0 or 1 as 
in (8.36): if Aj = 1 then Y) comes from model 2, otherwise it comes from 
model 1. Suppose we knew the values of the Aj’s. Then the log-likelihood 
would be 

N 

4(0; Z, A) = ^ [(1 — Aj) log c/)g 1 (yi) + A, log cj>e 2 (yi)] 

i—1 

N 

+ Yl [(! - Aj) log(l - tt) + A i log tt] , (8.40) 

i=1 

and the maximum likelihood estimates of /ii and af would be the sample 
mean and variance for those data with A, = 0 , and similarly those for y 2 
and (j| would be the sample mean and variance of the data with Aj = 1 . 
The estimate of n would be the proportion of Aj = 1. 

Since the values of the Aj’s are actually unknown, we proceed in an 
iterative fashion, substituting for each Aj in (8.40) its expected value 

7i(0) = E(Aj|0, Z) = Pr(Aj = 1\0, Z), (8.41) 

also called the responsibility of model 2 for observation i. We use a proce¬ 
dure called the EM algorithm, given in Algorithm 8.1 for the special case of 
Gaussian mixtures. In the expectation step, we do a soft assignment of each 
observation to each model: the current estimates of the parameters are used 
to assign responsibilities according to the relative density of the training 
points under each model. In the maximization step, these responsibilities 
are used in weighted maximum-likelihood fits to update the estimates of 
the parameters. 

A good way to construct initial guesses for /A and /I 2 is simply to choose 
two of the y.j at random. Both a\ and u\ can be set equal to the overall 
sample variance JA=i (Vi ~ v) 2 /N • The mixing proportion if can be started 
at the value 0.5. 

Note that the actual maximizer of the likelihood occurs when we put a 
spike of infinite height at any one data point, that is, /ti = yi for some 
i and af = 0. This gives infinite likelihood, but is not a useful solution. 
Hence we are actually looking for a good local maximum of the likelihood, 
one for which af. g\ > 0. To further complicate matters, there can be 
more than one local maximum having ai ,&2 > 0 - I n our example, we 
ran the EM algorithm with a number of different initial guesses for the 
parameters, all having <j\ > 0.5, and chose the run that gave us the highest 
maximized likelihood. Figure 8.6 shows the progress of the EM algorithm in 
maximizing the log-likelihood. Table 8.2 shows if = JT 7 i/N, the maximum 
likelihood estimate of the proportion of observations in class 2 , at selected 
iterations of the EM procedure. 
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Algorithm 8.1 EM Algorithm for Two-component Gaussian Mixture. 

1. Take initial guesses for the parameters /ti, af, fa, if (see text). 

2. Expectation Step : compute the responsibilities 


7 i = 


T<t>eSyi) 


( 1 -7r)^ i (2/ i )+7r^ 2 (y i ) 




(8.42) 


3. Maximization Step: compute the weighted means and variances: 


Eti(l~7^ 

EL(l-7i) ’ 

E»=i 7 iUi 


Mi = 


E£i(i-7i)(w-£i) 2 


Eti(l-7i) 

e: v =i7 1 (^-a2 ) 2 


M 2 = 


^i=l 

^iV 


>iV 


V—V 1 \ ~ 5 ^ AT ~ 

Ei=l 7< Ei=l 7< 

and the mixing probability if = Uli/N. 

4. Iterate steps 2 and 3 until convergence. 


TABLE 8.2. Selected iterations of the EM algorithm for mixture example. 


Iteration 

7r 

1 

0.485 

5 

0.493 

10 

0.523 

15 

0.544 

20 

0.546 


The final maximum likelihood estimates are 

Mi = 4.62, b\ = 0.87, 

/t 2 = 1.06, a\ = 0.77, 

if = 0.546. 

The right panel of Figure 8.5 shows the estimated Gaussian mixture density 
from this procedure (solid red curve), along with the responsibilities (dotted 
green curve). Note that mixtures are also useful for supervised learning; in 
Section 6.7 we show how the Gaussian mixture model leads to a version of 
radial basis functions. 
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FIGURE 8.6. EM algorithm: observed data log-likelihood as a function of the 
iteration number. 

8.5.2 The EM Algorithm in General 

The above procedure is an example of the EM (or Baum-Welch) algorithm 
for maximizing likelihoods in certain classes of problems. These problems 
are ones for which maximization of the likelihood is difficult, but made 
easier by enlarging the sample with latent (unobserved) data. This is called 
data augmentation. Here the latent data are the model memberships A*. 
In other problems, the latent data are actual data that should have been 
observed but are missing. 

Algorithm 8.2 gives the general formulation of the EM algorithm. Our 
observed data is Z, having log-likelihood £{9\ Z) depending on parameters 
6. The latent or missing data is Z m , so that the complete data is T = 
(Z,Z m ) with log-likelihood £o{9\T), £q based on the complete density. In 
the mixture problem (Z, Z m ) = (y, A), and £q{9; T) is given in (8.40). 

In our mixture example, E(£o{6'; T)|Z, 9^) is simply (8.40) with the Aj 
replaced by the responsibilities 7 i(0), and the maximizers in step 3 are just 
weighted means and variances. 

We now give an explanation of why the EM algorithm works in general. 
Since 

Pr(Z m |2, ff) = P, ^’y , (8-44) 

we can write 

< 8 - 45 > 

Iii terms of log-likelihoods, we have £(9'\ Z) = £0 (O'; T)—£i(9'; Z m |Z), where 
£\ is based on the conditional density Pr(Z m |Z, 9'). Taking conditional 
expectations with respect to the distribution of T|Z governed by parameter 
9 gives 



£(9'; Z) = E[£o(0'; T)|Z, 6\ — E[t' 1 (0 / ; Z m |Z)|Z, 9\ 
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Algorithm 8.2 The EM Algorithm. 

1. Start with initial guesses for the parameters 

2. Expectation Step : at the jth step, compute 

Q{9',e^) = E(4(0';T)|Z,0«) (8.43) 

as a function of the dummy argument 9'. 

3. Maximization Step: determine the new estimate 9^ +1 ' ) as the maxi¬ 
mizer of Q(9',§(rt) over 9'. 

4. Iterate steps 2 and 3 until convergence. 


= Q(0',0) -R{9',9). (8.46) 

In the M step, the EM algorithm maximizes Q(9' 1 9) over 9 ', rather than 
the actual objective function l{9'\ Z). Why does it succeed in maximizing 
t{9'\ Z)? Note that R(9* ,9) is the expectation of a log-likelihood of a density 
(indexed by 9 *), with respect to the same density indexed by 9 , and hence 
(by Jensen’s inequality) is maximized as a function of 9* , when 9* = 9 (see 
Exercise 8.1). So if 9’ maximizes Q(9' 1 9 ), we see that 

t(9'\ Z) — l{9\ Z) = [Q(9',9)-Q(9,9)\-[R(9',9)-R(9,9)\ 

> 0. (8.47) 

Hence the EM iteration never decreases the log-likelihood. 

This argument also makes it clear that a full maximization in the M 
step is not necessary: we need only to find a value so that Q(9 ', flw) 

increases as a function of the first argument, that is, Q(9^ +1 \9^) > 
Q(9^\9^). Such procedures are called GEM (generalized EM) algorithms. 
The EM algorithm can also be viewed as a minorization procedure: see 
Exercise 8.7. 


8.5.3 EM as a Maximization-Maximization Procedure 

Here is a different view of the EM procedure, as a joint maximization 
algorithm. Consider the function 

F{&, P) = E P \t 0 {9'- T)] - Ep[log P(Z m )\. (8.48) 

Here P{ Z m ) is any distribution over the latent data Z m . In the mixture 
example, P(Z m ) comprises the set of probabilities 7 j = Pr(Ai = 1 | 0 , Z). 
Note that F evaluated at P(Z m ) = Pr(Z m |Z,0'), is the log-likelihood of 
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1 2 3 4 5 

Latent Data Parameters 

FIGURE 8.7. Maximization-maximization view of the EM algorithm. Shown 
are the contours of the (augmented) observed data log-likelihood F(0',P). The 
E step is equivalent to maximizing the log-likelihood over the parameters of the 
latent data distribution. The M step maximizes it over the parameters of the 
log-likelihood. The red curve corresponds to the observed data log-likelihood, a 
profile obtained by maximizing F(6',P) for each value of 6'. 

the observed data, from (8.46) 1 . The function F expands the domain of 
the log-likelihood, to facilitate its maximization. 

The EM algorithm can be viewed as a joint maximization method for F 
over 9' and P(Z m ), by fixing one argument and maximizing over the other. 
The maximizer over P( Z m ) for fixed 9' can be shown to be 

P(Z m ) = Pr(Z m |Z,6»') (8.49) 

(Exercise 8.2). This is the distribution computed by the E step, for example, 
(8.42) in the mixture example. In the M step, we maximize F(9', P) over 9 1 
with P fixed: this is the same as maximizing the first term Ep[£o (O'] T)|Z, 9\ 
since the second term does not involve 9'. 

Finally, since F(6’,P) and the observed data log-likelihood agree when 
P(Z m ) = Pr(Z m |Z,0'), maximization of the former accomplishes maxi¬ 
mization of the latter. Figure 8.7 shows a schematic view of this process. 
This view of the EM algorithm leads to alternative maximization proce- 


1 (8.46) holds for all 9, including 9 = 9'. 
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Algorithm 8.3 Gibbs Sampler. 

1. Take some initial values uj^\ k = 1, 2,, K. 

2. Repeat for t = 1,2,...,. : 


For k = 1,2,..., K generate from 


Pr ([/£W 


tj w U' 

’ u k -1 >; 


k-1 ’ ^fc+l 


' • • • j J 


3. Continue step 2 until the joint distribution of (u[*\ U%\ ■ ■ ■, U^) 
does not change. 


dures. For example, one does not need to maximize with respect to all of 
the latent data parameters at once, but could instead maximize over one 
of them at a time, alternating with the M step. 


8.6 MCMC for Sampling from the Posterior 

Having defined a Bayesian model, one would like to draw samples from 
the resulting posterior distribution, in order to make inferences about the 
parameters. Except for simple models, this is often a difficult computa¬ 
tional problem. In this section we discuss the Markov chain Monte Carlo 
(MCMC) approach to posterior sampling. We will see that Gibbs sampling, 
an MCMC procedure, is closely related to the EM algorithm: the main dif¬ 
ference is that it samples from the conditional distributions rather than 
maximizing over them. 

Consider first the following abstract problem. We have random variables 
Ci, {/ 2 ,..., Uk and we wish to draw a sample from their joint distribution. 
Suppose this is difficult to do, but it is easy to simulate from the conditional 
distributions Pr(C J |L r 1 , t/ 2 , •.., Uj-i, ..., Uk), j = 1,2,..., A. The 
Gibbs sampling procedure alternatively simulates from each of these distri¬ 
butions and when the process stabilizes, provides a sample from the desired 
joint distribution. The procedure is defined in Algorithm 8.3. 

Under regularity conditions it can be shown that this procedure even¬ 
tually stabilizes, and the resulting random variables are indeed a sample 
from the joint distribution of U\, t/ 2 ,..., Uk- This occurs despite the fact 
that the samples {u\ V> , U% \ ■ ■ ■, U^ 1 ) are clearly not independent for dif¬ 
ferent t. More formally, Gibbs sampling produces a Markov chain whose 
stationary distribution is the true joint distribution, and hence the term 
“Markov chain Monte Carlo.” It is not surprising that the true joint dis¬ 
tribution is stationary under this process, as the successive steps leave the 
marginal distributions of the E4’s unchanged. 
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Note that we don’t need to know the explicit form of the conditional 
densities, but just need to be able to sample from them. After the procedure 
reaches stationarity, the marginal density of any subset of the variables 
can be approximated by a density estimate applied to the sample values. 
However if the explicit form of the conditional density Pr([/*,, \Un,i ^ k) 
is available, a better estimate of say the marginal density of Uk can be 
obtained from (Exercise 8.3): 



(8.50) 


Here we have averaged over the last M — m + 1 members of the sequence, 
to allow for an initial “burn-in” period before stationarity is reached. 

Now getting back to Bayesian inference, our goal is to draw a sample from 
the joint posterior of the parameters given the data Z. Gibbs sampling will 
be helpful if it is easy to sample from the conditional distribution of each 
parameter given the other parameters and Z. An example—the Gaussian 
mixture problem—is detailed next. 

There is a close connection between Gibbs sampling from a posterior and 
the EM algorithm in exponential family models. The key is to consider the 
latent data Z m from the EM procedure to be another parameter for the 
Gibbs sampler. To make this explicit for the Gaussian mixture problem, 
we take our parameters to be (0, Z m ). For simplicity we fix the variances 
erf, erf and mixing proportion 7r at their maximum likelihood values so that 
the only unknown parameters in 6 are the means pi and H 2 - The Gibbs 
sampler for the mixture problem is given in Algorithm 8.4. We see that 
steps 2(a) and 2(b) are the same as the E and M steps of the EM pro¬ 
cedure, except that we sample rather than maximize. In step 2(a), rather 
than compute the maximum likelihood responsibilities ji = E(Aj|0, Z), 
the Gibbs sampling procedure simulates the latent data A,; from the distri¬ 
butions Pr(Aj|0, Z). In step 2(b), rather than compute the maximizers of 
the posterior Pr(/x l5 /j , 2 , A|Z) we simulate from the conditional distribution 


Pr(/ii,/i 2 |A,Z). 


Figure 8.8 shows 200 iterations of Gibbs sampling, with the mean param¬ 
eters Hi (lower) and fi 2 (upper) shown in the left panel, and the proportion 
of class 2 observations JT Aj/7V on the right. Horizontal broken lines have 
been drawn at the maximum likelihood estimate values /ti, /t2 and ]Tb "fi/N 
in each case. The values seem to stabilize quite quickly, and are distributed 
evenly around the maximum likelihood values. 

The above mixture model was simplified, in order to make the clear 
connection between Gibbs sampling and the EM algorithm. More realisti¬ 
cally, one would put a prior distribution on the variances af , erf and mixing 
proportion 7r, and include separate Gibbs sampling steps in which we sam¬ 
ple from their posterior distributions, conditional on the other parameters. 
One can also incorporate proper (informative) priors for the mean param- 
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Algorithm 8.4 Gibbs sampling for mixtures. 

1. Take some initial values 9 = (/x^, )■ 

2. Repeat for t = 1,2,...,. 

(a) For i = 1,2 ,N generate g {0,1} with Pr(A-^ = 1) = 
7from equation (8.42). 

(b) Set 


Ef=iA f- yi 




Ml = 


M2 = 

,W AT(fi- Ab o„,1 ,,(*) ATt 


a(*) 

2-<s=l L1 i 


and generate fi\ 1 ~ lV(/x 1 ,<Tj ! ) and ~ _/V(/t 2 , df). 

3. Continue step 2 until the joint distribution of (A®, /x^, /x^) doesn’t 
change 



FIGURE 8.8. Mixture example. (Left panel:) 200 values of the two mean param¬ 
eters from Gibbs sampling; horizontal lines are drawn at the maximum likelihood 
estimates fi\, (12 ■ (Right panel:) Proportion of values with A; = 1, for each of the 
200 Gibbs sampling iterations; a horizontal line is drawn at JA 7 i/N. 
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eters. These priors must not be improper as this will lead to a degenerate 
posterior, with all the mixing weight on one component. 

Gibbs sampling is just one of a number of recently developed procedures 
for sampling from posterior distributions. It uses conditional sampling of 
each parameter given the rest, and is useful when the structure of the prob¬ 
lem makes this sampling easy to carry out. Other methods do not require 
such structure, for example the Metropolis-Hastings algorithm. These and 
other computational Bayesian methods have been applied to sophisticated 
learning algorithms such as Gaussian process models and neural networks. 
Details may be found in the references given in the Bibliographic Notes at 
the end of this chapter. 


8.7 Bagging 

Earlier we introduced the bootstrap as a way of assessing the accuracy of a 
parameter estimate or a prediction. Here we show how to use the bootstrap 
to improve the estimate or prediction itself. In Section 8.4 we investigated 
the relationship between the bootstrap and Bayes approaches, and found 
that the bootstrap mean is approximately a posterior average. Bagging 
further exploits this connection. 

Consider first the regression problem. Suppose we fit a model to our 
training data Z = {(aq, yi), (x2, 2/2), • • ■, (&jv, Vn)}, obtaining the predic¬ 
tion f(x) at input x. Bootstrap aggregation or bagging averages this predic¬ 
tion over a collection of bootstrap samples, thereby reducing its variance. 
For each bootstrap sample Z* fc , b = 1, 2,..., B, we fit our model, giving 
prediction f* b (x). The bagging estimate is defined by 

1 B 

/bag(a) = f * b ( x )' ( 8 - 51 ) 

n 6=1 


Denote by V the empirical distribution putting equal probability 1/iV on 
each of the data points {Xi,yi). In fact the “true” bagging estimate is 
defined by Epf*(x), where Z* = {(a, ( x* N ,y* N )} and each 
(x *, y*) ~ V. Expression (8.51) is a Monte Carlo estimate of the true 
bagging estimate, approaching it as B —> 00. 

The bagged estimate (8.51) will differ from the original estimate f(x) 
only when the latter is a nonlinear or adaptive function of the data. For 
example, to bag the 5-spline smooth of Section 8.2.1, we average the curves 
in the bottom left panel of Figure 8.2 at each value of x. The 5-spline 
smoother is linear in the data if we fix the inputs; hence if we sample using 
the parametric bootstrap in equation (8.6), then /bag(^) -A f(x) as 5 —> 00 
(Exercise 8.4). Hence bagging just reproduces the original smooth in the 
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top left panel of Figure 8.2. The same is approximately true if we were to 
bag using the nonparametric bootstrap. 

A more interesting example is a regression tree, where f(x) denotes the 
tree’s prediction at input vector x (regression trees are described in Chap¬ 
ter 9). Each bootstrap tree will typically involve different features than the 
original, and might have a different number of terminal nodes. The bagged 
estimate is the average prediction at x from these B trees. 

Now suppose our tree produces a classifier G(x) for a iC-class response. 
Here it is useful to consider an underlying indicator-vector function f(x ), 
with value a single one and K — 1 zeroes, such that G(x) = arg max*. /( x). 
Then the bagged estimate fbag{x) (8.51) is a iF-vector [pi(x), p 2 {x) ,..., 
Pk{x)\, with Pk(x ) equal to the proportion of trees predicting class k at x. 
The bagged classifier selects the class with the most “votes” from the B 
trees, G b ag(a;) = argmax fc fbag{x). 

Often we require the class-probability estimates at x, rather than the 
classifications themselves. It is tempting to treat the voting proportions 
Pk(x ) as estimates of these probabilities. A simple two-class example shows 
that they fail in this regard. Suppose the true probability of class 1 at x is 
0.75, and each of the bagged classifiers accurately predict a 1. Thenpi(x) = 
1, which is incorrect. For many classifiers G(x), however, there is already 
an underlying function /(x) that estimates the class probabilities at x (for 
trees, the class proportions in the terminal node). An alternative bagging 
strategy is to average these instead, rather than the vote indicator vectors. 
Not only does this produce improved estimates of the class probabilities, 
but it also tends to produce bagged classifiers with lower variance, especially 
for small B (see Figure 8.10 in the next example). 


8.7.1 Example: Trees with Simulated Data 

We generated a sample of size N = 30, with two classes and p = 5 features, 
each having a standard Gaussian distribution with pairwise correlation 
0.95. The response Y was generated according to Pr(P = l|xi < 0.5) = 0.2, 
Pr(y = l|xi > 0.5) = 0.8. The Bayes error is 0.2. A test sample of size 2000 
was also generated from the same population. We fit classification trees to 
the training sample and to each of 200 bootstrap samples (classification 
trees are described in Chapter 9). No pruning was used. Figure 8.9 shows 
the original tree and eleven bootstrap trees. Notice how the trees are all 
different, with different splitting features and outpoints. The test error for 
the original tree and the bagged tree is shown in Figure 8.10. In this ex¬ 
ample the trees have high variance due to the correlation in the predictors. 
Bagging succeeds in smoothing out this variance and hence reducing the 
test error. 

Bagging can dramatically reduce the variance of unstable procedures 
like trees, leading to improved prediction. A simple argument shows why 
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b = 3 


b = 4 


b = 5 



b = 9 


b = 10 


b = 11 


x.l < 0.395 


x.1 < 0.555 


x.l < 0.555 



FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows the 
original tree. Eleven trees grown on bootstrap samples are shown. For each tree, 
the top split is annotated. 
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FIGURE 8.10. Error curves for the bagging example of Figure 8.9. Shown is 
the test error of the original tree and bagged trees as a function of the number of 
bootstrap samples. The orange points correspond to the consensus vote, while the 
green points average the probabilities. 

bagging helps under squared-error loss, in short because averaging reduces 
variance and leaves bias unchanged. 

Assume our training observations (), i = 1, ...,7V are indepen¬ 
dently drawn from a distribution V , and consider the ideal aggregate es¬ 
timator / ag (x) = E -pf*(x). Here x is fixed and the bootstrap dataset Z* 
consists of observations x *, y*, i = 1,2,..., N sampled from V. Note that 
/ ag (x) is a bagging estimate, drawing bootstrap samples from the actual 
population V rather than the data. It is not an estimate that we can use 
in practice, but is convenient for analysis. We can write 

E v [Y-f*(x)} 2 = E P [F-/ ag (x) + / ag (x)-r(x )] 2 

= E V [Y- / ag ( x )] 2 + E r [f*(x) - / ag ( x )] 2 
> E P [F-/ ag (x)] 2 . (8.52) 

The extra error on the right-hand side comes from the variance of f*(x) 
around its mean / ag (x). Therefore true population aggregation never in¬ 
creases mean squared error. This suggests that bagging—drawing samples 
from the training data— will often decrease mean-squared error. 

The above argument does not hold for classification under 0-1 loss, be¬ 
cause of the nonadditivity of bias and variance. In that setting, bagging a 
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good classifier can make it better, but bagging a bad classifier can make it 
worse. Here is a simple example, using a randomized rule. Suppose Y = 1 
for all x, and the classifier G( x) predicts Y = 1 (for all x) with proba¬ 
bility 0.4 and predicts Y = 0 (for all x) with probability 0.6. Then the 
misclassification error of G(x) is 0.6 but that of the bagged classifier is 1.0. 

For classification we can understand the bagging effect in terms of a 
consensus of independent weak learners (Dietterich, 2000a). Let the Bayes 
optimal decision at x be G(x ) = 1 in a two-class example. Suppose each 
of the weak learners GJ have an error-rate ej, = e < 0.5, and let S\(x) = 
J2b= i H^b( x ) = 1) consensus vote for class 1. Since the weak learn¬ 

ers are assumed to be independent, Si (a;) ~ Bin(H, 1 — e), and Pr(Si > 
B/2) —> 1 as B gets large. This concept has been popularized outside of 
statistics as the “Wisdom of Crowds” (Surowiecki, 2004) — the collective 
knowledge of a diverse and independent body of people typically exceeds 
the knowledge of any single individual, and can be harnessed by voting. 
Of course, the main caveat here is “independent,” and bagged trees are 
not. Figure 8.11 illustrates the power of a consensus vote in a simulated 
example, where only 30% of the voters have some knowledge. 

In Chapter 15 we see how random forests improve on bagging by reducing 
the correlation between the sampled trees. 

Note that when we bag a model, any simple structure in the model is 
lost. As an example, a bagged tree is no longer a tree. For interpretation 
of the model this is clearly a drawback. More stable procedures like near¬ 
est neighbors are typically not affected much by bagging. Unfortunately, 
the unstable models most helped by bagging are unstable because of the 
emphasis on interpretability, and this is lost in the bagging process. 

Figure 8.12 shows an example where bagging doesn’t help. The 100 data 
points shown have two features and two classes, separated by the gray 
linear boundary x\ + X 2 = 1. We choose as our classifier G(x) a single 
axis-oriented split, choosing the split along either x± or a; 2 that produces 
the largest decrease in training misclassification error. 

The decision boundary obtained from bagging the 0-1 decision rule over 
B = 50 bootstrap samples is shown by the blue curve in the left panel. 
It does a poor job of capturing the true boundary. The single split rule, 
derived from the training data, splits near 0 (the middle of the range of X\ 
or x 2 ), and hence has little contribution away from the center. Averaging 
the probabilities rather than the classifications does not help here. Bagging 
estimates the expected class probabilities from the single split rule, that is, 
averaged over many replications. Note that the expected class probabilities 
computed by bagging cannot be realized on any single replication, in the 
same way that a woman cannot have 2.4 children. In this sense, bagging 
increases somewhat the space of models of the individual base classifier. 
However, it doesn’t help in this and many other examples where a greater 
enlargement of the model class is needed. “Boosting” is a way of doing this 
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P - Probability of Informed Person Being Correct 


FIGURE 8.11. Simulated academy awards voting. 50 members vote in 10 cat¬ 
egories, each with 4 nominations. For any category, only 15 voters have some 
knowledge, represented by their probability of selecting the “correct” candidate in 
that category (so P = 0.25 means they have no knowledge). For each category, the 
15 experts are chosen at random from the 50. Results show the expected correct 
(based on 50 simulations) for the consensus, as well as for the individuals. The 
error bars indicate one standard deviation. We see, for example, that if the 15 
informed for a category have a 50% chance of selecting the correct candidate, the 
consensus doubles the expected performance of an individual. 
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Bagged Decision Rule 


Boosted Decision Rule 



FIGURE 8.12. Data with two features and two classes, separated by a linear 
boundary. (Left panel:) Decision boundary estimated from bagging the decision 
rule from a single split, axis-oriented classifier. (Right panel:) Decision boundary 
from boosting the decision rule of the same classifier. The test error rates are 
0.166, and 0.065, respectively. Boosting is described in Chapter 10. 

and is described in Chapter 10. The decision boundary in the right panel is 
the result of the boosting procedure, and it roughly captures the diagonal 
boundary. 


8.8 Model Averaging and Stacking 

In Section 8.4 we viewed bootstrap values of an estimator as approximate 
posterior values of a corresponding parameter, from a kind of nonparamet- 
ric Bayesian analysis. Viewed in this way, the bagged estimate (8.51) is 
an approximate posterior Bayesian mean. In contrast, the training sample 
estimate f(x) corresponds to the mode of the posterior. Since the posterior 
mean (not mode) minimizes squared-error loss, it is not surprising that 
bagging can often reduce mean squared-error. 

Here we discuss Bayesian model averaging more generally. We have a 
set of candidate models A4 m , m = 1,..., M for our training set Z. These 
models may be of the same type with different parameter values (e.g., 
subsets in linear regression), or different models for the same task (e.g., 
neural networks and regression trees). 

Suppose (, is some quantity of interest, for example, a prediction f(x) at 
some fixed feature value x. The posterior distribution of £ is 

M 

Pr(C|Z) = ^ Pr(C|M m ,Z)Pr(M m |Z), 

m—1 


(8.53) 
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with posterior mean 


M 

E(C|Z) = ^2 E(C|M m , Z)Pr(A4 m |Z). (8.54) 

m= 1 

This Bayesian prediction is a weighted average of the individual predictions, 
with weights proportional to the posterior probability of each model. 

This formulation leads to a number of different model-averaging strate¬ 
gies. Committee methods take a simple unweighted average of the predic¬ 
tions from each model, essentially giving equal probability to each model. 
More ambitiously, the development in Section 7.7 shows the BIC criterion 
can be used to estimate posterior model probabilities. This is applicable 
in cases where the different models arise from the same parametric model, 
with different parameter values. The BIC gives weight to each model de¬ 
pending on how well it fits and how many parameters it uses. One can also 
carry out the Bayesian recipe in full. If each model A4 m has parameters 
9 m , we write 


Pr(A4 m |Z) oc Pr(A4 m ) • Pr(Z|Ai m ) 

oc Pr(A4 m ) • j Pi(Z\9 m ,Mm)Pr(O m \Mm)d0 m . 

(8.55) 


In principle one can specify priors Pr(0 m |A4 m ) and numerically com¬ 
pute the posterior probabilities from (8.55), to be used as model-averaging 
weights. However, we have seen no real evidence that this is worth all of 
the effort, relative to the much simpler BIC approximation. 

How can we approach model averaging from a frequentist viewpoint? 
Given predictions fi(x), ^(x),..., }m{x), under squared-error loss, we can 
seek the weights w = (w\,W 2 , ■ • •, wm) such that 


ui = argmin E-p 


Y - 


M 

E 

m —1 


W 1 




1 2 


(8.56) 


Here the input value x is fixed and the N observations in the dataset Z (and 
the target Y) are distributed according to V. The solution is the population 
linear regression of Y on F{x) T = [fi{x),f 2 {x ),..., /m(®)]: 

w = E v [F{x)F(x) T ]- 1 E v [F{x)Y}. (8.57) 


Now the full regression has smaller error than any single model 



M 

2 


E -p 

Y ^ ^ tijmfm (e) 

m—1 

— E-p 

Y - f m (x) 


(8.58) 


so combining models never makes things worse, at the population level. 
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Of course the population linear regression (8.57) is not available, and it 
is natural to replace it with the linear regression over the training set. But 
there are simple examples where this does not work well. For example, if 
/ m (x), m = 1,2,..., M represent the prediction from the best subset of 
inputs of size m among M total inputs, then linear regression would put all 
of the weight on the largest model, that is, wm = 1, w m = 0, m < M. The 
problem is that we have not put each of the models on the same footing 
by taking into account their complexity (the number of inputs m in this 
example). 

Stacked generalization , or stacking , is a way of doing this. Let /"'(i) 
be the prediction at x, using model m, applied to the dataset with the 
ith training observation removed. The stacking estimate of the weights is 
obtained from the least squares linear regression of y; on m = 

1,2,..., M. In detail the stacking weights are given by 


N 

w st = argmin ^ 

w , 


M 


Vi 


y ] w mfm i X i) 


(8.59) 


The final prediction is By using the cross-validated pre¬ 

dictions fm{x), stacking avoids giving unfairly high weight to models with 
higher complexity. Better results can be obtained by restricting the weights 
to be nonnegative, and to sum to 1. This seems like a reasonable restriction 
if we interpret the weights as posterior model probabilities as in equation 
(8.54), and it leads to a tractable quadratic programming problem. 

There is a close connection between stacking and model selection via 
leave-one-out cross-validation (Section 7.10). If we restrict the minimization 
in (8.59) to weight vectors w that have one unit weight and the rest zero, 
this leads to a model choice m with smallest leave-one-out cross-validation 
error. Rather than choose a single model, stacking combines them with 
estimated optimal weights. This will often lead to better prediction, but 
less interpretability than the choice of only one of the M models. 

The stacking idea is actually more general than described above. One 
can use any learning method, not just linear regression, to combine the 
models as in (8.59); the weights could also depend on the input location 
x. In this way, learning methods are “stacked” on top of one another, to 
improve prediction performance. 


8.9 Stochastic Search: Bumping 

The final method described in this chapter does not involve averaging or 
combining models, but rather is a technique for finding a better single 
model. Bumping uses bootstrap sampling to move randomly through model 
space. For problems where fitting method finds many local minima, bump¬ 
ing can help the method to avoid getting stuck in poor solutions. 




8.9 Stochastic Search: Bumping 291 


Regular 4-Node Tree Bumped 4-Node Tree 



FIGURE 8.13. Data with two features and two classes (blue and orange), dis¬ 
playing a pure interaction. The left panel shows the partition found by three splits 
of a standard, greedy, tree-growing algorithm. The vertical grey line near the left 
edge is the first split, and the broken lines are the two subsequent splits. The al¬ 
gorithm has no idea where to make a good initial split, and makes a poor choice. 
The right panel shows the near-optimal splits found by bumping the tree-growing 
algorithm 20 times. 

As in bagging, we draw bootstrap samples and fit a model to each. But 
rather than average the predictions, we choose the model estimated from a 
bootstrap sample that best fits the training data. In detail, we draw boot¬ 
strap samples Z* 1 ,.. ., Z* B and fit our model to each, giving predictions 
f* b ( x), b = 1,2,... ,B at input point x. We then choose the model that 
produces the smallest prediction error, averaged over the original training 
set. For squared error, for example, we choose the model obtained from 
bootstrap sample b , where 

N 

b = argminVV - f* b (xi)} 2 . (8.60) 

0 Z ' 
i =1 

The corresponding model predictions are }* b (x). By convention we also 
include the original training sample in the set of bootstrap samples, so that 
the method is free to pick the original model if it has the lowest training 
error. 

By perturbing the data, bumping tries to move the fitting procedure 
around to good areas of model space. For example, if a few data points are 
causing the procedure to find a poor solution, any bootstrap sample that 
omits those data points should procedure a better solution. 

For another example, consider the classification data in Figure 8.13, the 
notorious exclusive or (XOR) problem. There are two classes (blue and 
orange) and two input features, with the features exhibiting a pure inter- 
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action. By splitting the data at X\ = 0 and then splitting each resulting 
strata at X 2 = 0, (or vice versa) a tree-based classifier could achieve per¬ 
fect discrimination. However, the greedy, short-sighted CART algorithm 
(Section 9.2) tries to find the best split on either feature, and then splits 
the resulting strata. Because of the balanced nature of the data, all initial 
splits on X\ or x 2 appear to be useless, and the procedure essentially gener¬ 
ates a random split at the top level. The actual split found for these data is 
shown in the left panel of Figure 8.13. By bootstrap sampling from the data, 
bumping breaks the balance in the classes, and with a reasonable number 
of bootstrap samples (here 20), it will by chance produce at least one tree 
with initial split near either Xj = 0 or X 2 = 0. Using just 20 bootstrap 
samples, bumping found the near optimal splits shown in the right panel 
of Figure 8.13. This shortcoming of the greedy tree-growing algorithm is 
exacerbated if we add a number of noise features that are independent of 
the class label. Then the tree-growing algorithm cannot distinguish X\ or 
X 2 from the others, and gets seriously lost. 

Since bumping compares different models on the training data, one must 
ensure that the models have roughly the same complexity. In the case of 
trees, this would mean growing trees with the same number of terminal 
nodes on each bootstrap sample. Bumping can also help in problems where 
it is difficult to optimize the fitting criterion, perhaps because of a lack of 
smoothness. The trick is to optimize a different, more convenient criterion 
over the bootstrap samples, and then choose the model producing the best 
results for the desired criterion on the training sample. 
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a joint maximization scheme for a penalized complete-data log-likelihood 
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Breiman (1996b) contains an accessible discussion for statisticians. Leblanc 
and Tibshirani (1996) describe variations on stacking based on the boot¬ 
strap. Model averaging in the Bayesian framework has been recently advo¬ 
cated by Madigan and Raftery (1994). Bumping was proposed by Tibshi¬ 
rani and Knight (1999). 


Exercises 


Ex. 8.1 Let r(y) and q(y) be probability density functions. Jensen’s in¬ 
equality states that for a random variable X and a convex function 
E[</>(X)] > ^>[E(Jf)]. Use Jensen’s inequality to show that 

E,log[r(Y)/<j(Y)] (8.61) 

is maximized as a function of r(y) when r(y ) = q(y). Hence show that 
R{6,6) > R(6',6) as stated below equation (8.46). 

Ex. 8.2 Consider the maximization of the log-likelihood (8.48), over dis¬ 
tributions P( Z m ) such that P( Z m ) > 0 and Ez m -P(Z m ) = 1. Use La¬ 
grange multipliers to show that the solution is the conditional distribution 
P(Z m ) = Pr(Z m |Z, O'), as in (8.49). 

Ex. 8.3 Justify the estimate (8.50), using the relationship 
Pr(.A) = j Pr(H|P)d(Pr(P)). 

Ex. 8.4 Consider the bagging method of Section 8.7. Let our estimate f(x) 
be the P-spline smoother x) of Section 8.2.1. Consider the parametric 
bootstrap of equation (8.6), applied to this estimator. Show that if we bag 
f(x), using the parametric bootstrap to generate the bootstrap samples, 
the bagging estimate fb&g(%) converges to the original estimate f(x) as 
B —> oo. 


Ex. 8.5 Suggest generalizations of each of the loss functions in Figure 10.4 
to more than two classes, and design an appropriate plot to compare them. 

Ex. 8.6 Consider the bone mineral density data of Figure 5.6. 

(a) Fit a cubic smooth spline to the relative change in spinal BMD, as a 

function of age. Use cross-validation to estimate the optimal amount 
of smoothing. Construct pointwise 90% confidence bands for the un¬ 
derlying function. 

(b) Compute the posterior mean and covariance for the true function via 

(8.28), and compare the posterior bands to those obtained in (a). 
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(c) Compute 100 bootstrap replicates of the fitted curves, as in the bottom 
left panel of Figure 8.2. Compare the results to those obtained in (a) 
and (b). 

Ex. 8.7 EM as a minorization algorithm(ilunter and Lange, 2004; Wu and 
Lange, 2007). A function g(x 7 y) to said to minorize a function f(x) if 

9 (x,y)<f(x), g(x,x) = f{x) (8.62) 

for all x, y in the domain. This is useful for maximizing /( x) since it is easy 
to show that f(x) is non-decreasing under the update 

x s+1 = argma x x g(x,x s ) (8.63) 

There are analogous definitions for majorization , for minimizing a function 
f(x). The resulting algorithms are known as MM algorithms, for “Minorize- 
Maximize” or “Majorize-Minimize.” 

Show that the EM algorithm (Section 8.5.2) is an example of an MM al¬ 
gorithm, using Q(6',9)+\ogPr(7i\0) — Q(6,6) to minorize the observed data 
log-likelihood 1(0'\ Z). (Note that only the first term involves the relevant 
parameter O'). 
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9 

Additive Models, Trees, and Related 
Methods 


In this chapter we begin our discussion of some specific methods for super¬ 
vised learning. These techniques each assume a (different) structured form 
for the unknown regression function, and by doing so they finesse the curse 
of dimensionality. Of course, they pay the possible price of misspecifying 
the model, and so in each case there is a tradeoff that has to be made. They 
take off where Chapters 3-6 left off. We describe five related techniques: 
generalized additive models, trees, multivariate adaptive regression splines, 
the patient rule induction method, and hierarchical mixtures of experts. 


9.1 Generalized Additive Models 

Regression models play an important role in many data analyses, providing 
prediction and classification rules, and data analytic tools for understand¬ 
ing the importance of different inputs. 

Although attractively simple, the traditional linear model often fails in 
these situations: in real life, effects are often not linear. In earlier chapters 
we described techniques that used predefined basis functions to achieve 
nonlinearities. This section describes more automatic flexible statistical 
methods that may be used to identify and characterize nonlinear regression 
effects. These methods are called “generalized additive models.” 

In the regression setting, a generalized additive model has the form 


E{Y\X 1 ,X 2 , ...,X p ) = a + ApC) + f 2 {X 2 ) + • • • + f p (X p ). (9.1) 
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As usual Xi,X 2 , ..., X p represent predictors and Y is the outcome; the fj’s 
are unspecified smooth (“nonparametric”) functions. If we were to model 
each function using an expansion of basis functions (as in Chapter 5), the 
resulting model could then be fit by simple least squares. Our approach 
here is different: we fit each function using a scatterplot smoother (e.g., a 
cubic smoothing spline or kernel smoother), and provide an algorithm for 
simultaneously estimating all p functions (Section 9.1.1). 

For two-class classification, recall the logistic regression model for binary 
data discussed in Section 4.4. We relate the mean of the binary response 
p{X) = Pr(E = 1|X) to the predictors via a linear regression model and 
the logit link function: 

log ^ — ^u(X)) = a + P 1 -^ 1 4-+ PpXp- (9.2) 

The additive logistic regression model replaces each linear term by a more 
general functional form 

log -^(X)) = a + f 1 ^ 1 ) _l -1" fp(Xp), (9.3) 

where again each fj is an unspecified smooth function. While the non¬ 
parametric form for the functions fj makes the model more flexible, the 
additivity is retained and allows us to interpret the model in much the 
same way as before. The additive logistic regression model is an example 
of a generalized additive model. In general, the conditional mean fi(X) of 
a response Y is related to an additive function of the predictors via a link 
function g: 

= a + /i(-Xi) 4-+ .fp(Xp). (9.4) 

Examples of classical link functions are the following: 

• g(p) = p, is the identity link, used for linear and additive models for 
Gaussian response data. 

• 9(f) = logit(/i) as above, or g(p) = probit(/Lt), the probit link function, 
for modeling binomial probabilities. The probit function is the inverse 
Gaussian cumulative distribution function: probit(^) = 4> _1 (/j,). 

• g(p) = log (p) for log-linear or log-additive models for Poisson count 
data. 

All three of these arise from exponential family sampling models, which 
in addition include the gamma and negative-binomial distributions. These 
families generate the well-known class of generalized linear models, which 
are all extended in the same way to generalized additive models. 

The functions fj are estimated in a flexible manner, using an algorithm 
whose basic building block is a scatterplot smoother. The estimated func¬ 
tion fj can then reveal possible nonlinearities in the effect of Xj. Not all 
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of the functions fj need to be nonlinear. We can easily mix in linear and 
other parametric forms with the nonlinear terms, a necessity when some of 
the inputs are qualitative variables (factors). The nonlinear terms are not 
restricted to main effects either; we can have nonlinear components in two 
or more variables, or separate curves in Xj for each level of the factor X *.. 
Thus each of the following would qualify: 

• g(g) = X T (3 + ak + f(Z) —a semiparametric model, where X is a 
vector of predictors to be modeled linearly, the effect for the kth 
level of a qualitative input V, and the effect of predictor Z is modeled 
nonparametrically. 

• g(g) = f(X) + gk{Z) —again k indexes the levels of a qualitative 
input V, and thus creates an interaction term g{V,Z) = gk{Z) for 
the effect of V and Z. 

• g(n) = f{X) + g(Z, W) where g is a nonparametric function in two 
features. 

Additive models can replace linear models in a wide variety of settings, 
for example an additive decomposition of time series, 


Yt — St + T t + £t, (9-5) 

where St is a seasonal component, T t is a trend and e is an error term. 

9.1.1 Fitting Additive Models 

In this section we describe a modular algorithm for fitting additive models 
and their generalizations. The building block is the scatterplot smoother 
for fitting nonlinear effects in a flexible way. For concreteness we use as our 
scatterplot smoother the cubic smoothing spline described in Chapter 5. 
The additive model has the form 

p 

Y = a + fj (Xj) + e, (9.6) 

j=i 

where the error term e has mean zero. Given observations a, y*, a criterion 
like the penalized sum of squares (5.9) of Section 5.4 can be specified for 
this problem, 


N ( v \ 2 P r 

PRSS(a,/i,/ 2 ,...,/ p ) = ^A, / 

i=i \ j =i / j=l J 

. (9J) 

where the Xj > 0 are tuning parameters. It can be shown that the minimizer 
of (9.7) is an additive cubic spline model; each of the functions fj is a 
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Algorithm 9.1 The Backfitting Algorithm for Additive Models. 

1. Initialize: a = fj = 0, Vi, j. 

2. Cycle: j = 1, 2,... ,p,. .. , 1, 2,. .. ,p,..., 


fj Sj 


{yi - OL - fk{x ik )}i 
kAj 

1 JV 


fj <- fj- x'E'fiM- 


until the functions fj change less than a prespecified threshold. 


cubic spline in the component Xj, with knots at each of the unique values 
of Xij, i = 1, ...,7V. However, without further restrictions on the model, 
the solution is not unique. The constant a is not identifiable, since we 
can add or subtract any constants to each of the functions fj, and adjust 
a accordingly. The standard convention is to assume that fj( x ij) = 
0 Vj—the functions average zero over the data. It is easily seen that a = 
a ve(yi) in this case. If in addition to this restriction, the matrix of input 
values (having ijth. entry x^) has full column rank, then (9.7) is a strictly 
convex criterion and the minimizer is unique. If the matrix is singular, then 
the linear part of the components fj cannot be uniquely determined (while 
the nonlinear parts can!)(Buja et ah, 1989). 

Furthermore, a simple iterative procedure exists for finding the solution. 
We set a = a ve(yi), and it never changes. We apply a cubic smoothing 
spline Sj to the targets {yi — a — J2k^j fkfxikffi , as a function of x^, 
to obtain a new estimate fj. This is done for each predictor in turn, using 
the current estimates of the other functions ff. when computing y, — a — 
J2kjtj .fk(xik). The process is continued until the estimates fj stabilize. This 
procedure, given in detail in Algorithm 9.1, is known as “backfitting” and 
the resulting fit is analogous to a multiple regression for linear models. 

In principle, the second step in (2) of Algorithm 9.1 is not needed, since 
the smoothing spline fit to a mean-zero response has mean zero (Exer¬ 
cise 9.1). In practice, machine rounding can cause slippage, and the ad¬ 
justment is advised. 

This same algorithm can accommodate other fitting methods in exactly 
the same way, by specifying appropriate smoothing operators Sj: 

• other univariate regression smoothers such as local polynomial re¬ 
gression and kernel methods; 
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• linear regression operators yielding polynomial fits, piecewise con¬ 
stant fits, parametric spline fits, series and Fourier fits; 

• more complicated operators such as surface smoothers for second or 
higher-order interactions or periodic smoothers for seasonal effects. 

If we consider the operation of smoother Sj only at the training points, it 
can be represented by an TV x N operator matrix S ; (see Section 5.4.1). 
Then the degrees of freedom for the j th term are (approximately) computed 
as df ? = trace[Sj] — 1, by analogy with degrees of freedom for smoothers 
discussed in Chapters 5 and 6. 

For a large class of linear smoothers Sj, backfitting is equivalent to a 
Gauss-Seidel algorithm for solving a certain linear system of equations. 
Details are given in Exercise 9.2. 

For the logistic regression model and other generalized additive models, 
the appropriate criterion is a penalized log-likelihood. To maximize it, the 
backfitting procedure is used in conjunction with a likelihood maximizer. 
The usual Newton-Raphson routine for maximizing log-likelihoods in gen¬ 
eralized linear models can be recast as an IRLS (iteratively reweighted 
least squares) algorithm. This involves repeatedly fitting a weighted linear 
regression of a working response variable on the covariates; each regression 
yields a new value of the parameter estimates, which in turn give new work¬ 
ing responses and weights, and the process is iterated (see Section 4.4.1). 
In the generalized additive model, the weighted linear regression is simply 
replaced by a weighted backfitting algorithm. We describe the algorithm in 
more detail for logistic regression below, and more generally in Chapter 6 
of Hastie and Tibshirani (1990). 

9.1.2 Example: Additive Logistic Regression 

Probably the most widely used model in medical research is the logistic 
model for binary data. In this model the outcome Y can be coded as 0 
or 1, with 1 indicating an event (like death or relapse of a disease) and 
0 indicating no event. We wish to model Pr(F = 1|X), the probability of 
an event given values of the prognostic factors X T = (Xi,... ,X p ). The 
goal is usually to understand the roles of the prognostic factors, rather 
than to classify new individuals. Logistic models are also used in applica¬ 
tions where one is interested in estimating the class probabilities, for use 
in risk screening. Apart from medical applications, credit risk screening is 
a popular application. 

The generalized additive logistic model has the form 

l0g Pr(F = o|x) = “ + ^-^ fp( x p)- ( 9 - 8 ) 

The functions /i, fi , ■ • •, f p are estimated by a backfitting algorithm 
within a Newton-Raphson procedure, shown in Algorithm 9.2. 
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Algorithm 9.2 Local Scoring Algorithm for the Additive Logistic Regres- 
sion Model. 

1. Compute starting values: a = log[y/(l — y)], where y = ave(y,;), the 
sample proportion of ones, and set fj = 0 Vj. 

2. Define fji = a + JY fj(%ij) and pi = 1/[1 + exp(-T)i)]. 

Iterate: 


(a) Construct the working target variable 


(b) 

(c) 


Zi = fji + 


(Hi - Pi) 


jP*(1 ~Pi)' 
Construct weights Wi = Pi( 1 — Pi) 


Fit an additive model to the targets z.- L with weights Wi, us¬ 
ing a weighted backfitting algorithm. This gives new estimates 
cb fji Vj 


3. Continue step 2. until the change in the functions falls below a pre¬ 
specified threshold. 


The additive model fitting in step (2) of Algorithm 9.2 requires a weighted 
scatterplot smoother. Most smoothing procedures can accept observation 
weights (Exercise 5.12); see Chapter 3 of Hastie and Tibshirani (1990) for 
further details. 

The additive logistic regression model can be generalized further to han¬ 
dle more than two classes, using the multilogit formulation as outlined in 
Section 4.4. While the formulation is a straightforward extension of (9.8), 
the algorithms for fitting such models are more complex. See Yee and Wild 
(1996) for details, and the VGAM software currently available from: 

http://www.stat.auckland.ac.nz/^yee . 

Example: Predicting Email Spam 

We apply a generalized additive model to the spam data introduced in 
Chapter 1. The data consists of information from 4601 email messages, in 
a study to screen email for “spam” (i.e., junk email). The data is publicly 
available at ftp.ics.uci.edu, and was donated by George Forman from 
Hewlett-Packard laboratories, Palo Alto, California. 

The response variable is binary, with values email or spam, and there are 
57 predictors as described below: 

• 48 quantitative predictors—the percentage of words in the email that 
match a given word. Examples include business, address, internet, 






9.1 Generalized Additive Models 


301 


TABLE 9.1. Test data confusion matrix for the additive logistic regression model 
fit to the spam training data. The overall test error rate is 5.5%. 



Predicted Class 

True Class 

email (0) 

spam (1) 

email (0) 

58.3% 

2.5% 

spam (1) 

3.0% 

36.3% 


free, and george. The idea was that these could be customized for 
individual users. 

• 6 quantitative predictors—the percentage of characters in the email 
that match a given character. The characters are ch;, ch(, ch[, ch!, 
ch$, and ch#. 

• The average length of uninterrupted sequences of capital letters: 

CAPAVE. 

• The length of the longest uninterrupted sequence of capital letters: 
CAPMAX. 

• The sum of the length of uninterrupted sequences of capital letters: 

CAPTOT. 

We coded spam as 1 and email as zero. A test set of size 1536 was randomly 
chosen, leaving 3065 observations in the training set. A generalized additive 
model was fit, using a cubic smoothing spline with a nominal four degrees of 
freedom for each predictor. What this means is that for each predictor Xj , 
the smoothing-spline parameter A j was chosen so that trace[Sj(A,,)] —1 = 4, 
where Sj(A) is the smoothing spline operator matrix constructed using the 
observed values Xij, i = 1,... ,7V. This is a convenient way of specifying 
the amount of smoothing in such a complex model. 

Most of the spam predictors have a very long-tailed distribution. Before 
fitting the GAM model, we log-transformed each variable (actually log(a; + 
0.1)), but the plots in Figure 9.1 are shown as a function of the original 
variables. 

The test error rates are shown in Table 9.1; the overall error rate is 5.3%. 
By comparison, a linear logistic regression has a test error rate of 7.6%. 
Table 9.2 shows the predictors that are highly significant in the additive 
model. 

For ease of interpretation, in Table 9.2 the contribution for each variable 
is decomposed into a linear component and the remaining nonlinear com¬ 
ponent. The top block of predictors are positively correlated with spam, 
while the bottom block is negatively correlated. The linear component is a 
weighted least squares linear fit of the fitted curve on the predictor, while 
the nonlinear part is the residual. The linear component of an estimated 
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TABLE 9.2. Significant predictors from the additive model fit to the spam train¬ 
ing data. The coefficients represent the linear part of fj, along with their standard 
errors and Z-score. The nonlinear P-value is for a test of nonlinearity of fj. 

Name Num. df Coefficient Std. Error Z Score Nonlinear 

P-value 


Positive effects 


our 

5 

3.9 

0.566 

0.114 

4.970 

0.052 

over 

6 

3.9 

0.244 

0.195 

1.249 

0.004 

remove 

7 

4.0 

0.949 

0.183 

5.201 

0.093 

internet 

8 

4.0 

0.524 

0.176 

2.974 

0.028 

free 

16 

3.9 

0.507 

0.127 

4.010 

0.065 

business 

17 

3.8 

0.779 

0.186 

4.179 

0.194 

hpl 

26 

3.8 

0.045 

0.250 

0.181 

0.002 

ch! 

52 

4.0 

0.674 

0.128 

5.283 

0.164 

ch$ 

53 

3.9 

1.419 

0.280 

5.062 

0.354 

CAPMAX 

56 

3.8 

0.247 

0.228 

1.080 

0.000 

CAPTOT 

57 

4.0 

0.755 

0.165 

4.566 

0.063 


Negative effects 


hp 

25 

3.9 

-1.404 

0.224 

-6.262 

0.140 

george 

27 

3.7 

-5.003 

0.744 

-6.722 

0.045 

1999 

37 

3.8 

-0.672 

0.191 

-3.512 

0.011 

re 

45 

3.9 

-0.620 

0.133 

-4.649 

0.597 

edu 

46 

4.0 

-1.183 

0.209 

-5.647 

0.000 


function is summarized by the coefficient, standard error and Z-score; the 
latter is the coefficient divided by its standard error, and is considered 
significant if it exceeds the appropriate quantile of a standard normal dis¬ 
tribution. The column labeled nonlinear P-value is a test of nonlinearity 
of the estimated function. Note, however, that the effect of each predictor 
is fully adjusted for the entire effects of the other predictors, not just for 
their linear parts. The predictors shown in the table were judged signifi¬ 
cant by at least one of the tests (linear or nonlinear) at the p = 0.01 level 
(two-sided). 

Figure 9.1 shows the estimated functions for the significant predictors 
appearing in Table 9.2. Many of the nonlinear effects appear to account for 
a strong discontinuity at zero. For example, the probability of spam drops 
significantly as the frequency of george increases from zero, but then does 
not change much after that. This suggests that one might replace each of 
the frequency predictors by an indicator variable for a zero count, and resort 
to a linear logistic model. This gave a test error rate of 7.4%; including the 
linear effects of the frequencies as well dropped the test error to 6.6%. It 
appears that the nonlinearities in the additive model have an additional 
predictive power. 






















/(ch!) /(george) /(free) /(our 


9.1 Generalized Additive Models 


303 



FIGURE 9.1. Spam analysis: estimated functions for significant predictors. The 
rug plot along the bottom of each frame indicates the observed values of the cor¬ 
responding predictor. For many of the predictors the nonlinearity picks up the 
discontinuity at zero. 




































































304 


9. Additive Models, Trees, and Related Methods 


It is more serious to classify a genuine email message as spam, since then 
a good email would be filtered out and would not reach the user. We can 
alter the balance between the class error rates by changing the losses (see 
Section 2.4). If we assign a loss L m for predicting a true class 0 as class 1, 
and Lio for predicting a true class 1 as class 0, then the estimated Bayes 
rule predicts class 1 if its probability is greater than L 0 i/(L 0 i + L\ 0 ). For 
example, if we take Loi = 10, L\q = 1 then the (true) class 0 and class 1 
error rates change to 0.8% and 8.7%. 

More ambitiously, we can encourage the model to fit better data in the 
class 0 by using weights L 0 i for the class 0 observations and L 10 for the 
class 1 observations. As above, we then use the estimated Bayes rule to 
predict. This gave error rates of 1.2% and 8.0% in (true) class 0 and class 1, 
respectively. We discuss below the issue of unequal losses further, in the 
context of tree-based models. 

After fitting an additive model, one should check whether the inclusion 
of some interactions can significantly improve the fit. This can be done 
“manually,” by inserting products of some or all of the significant inputs, 
or automatically via the MARS procedure (Section 9.4). 

This example uses the additive model in an automatic fashion. As a data 
analysis tool, additive models are often used in a more interactive fashion, 
adding and dropping terms to determine their effect. By calibrating the 
amount of smoothing in terms of dfy, one can move seamlessly between 
linear models (dfj = 1) and partially linear models, where some terms are 
modeled more flexibly. See Hastie and Tibshirani (1990) for more details. 


9.1.3 Summary 

Additive models provide a useful extension of linear models, making them 
more flexible while still retaining much of their interpretability. The familiar 
tools for modeling and inference in linear models are also available for 
additive models, seen for example in Table 9.2. The backfitting procedure 
for fitting these models is simple and modular, allowing one to choose a 
fitting method appropriate for each input variable. As a result they have 
become widely used in the statistical community. 

However additive models can have limitations for large data-mining ap¬ 
plications. The backfitting algorithm fits all predictors, which is not feasi¬ 
ble or desirable when a large number are available. The BRUTO procedure 
(Hastie and Tibshirani, 1990, Chapter 9) combines backfitting with selec¬ 
tion of inputs, but is not designed for large data-mining problems. There 
has also been recent work using lasso-type penalties to estimate sparse ad¬ 
ditive models, for example the COSSO procedure of Lin and Zhang (2006) 
and the SpAM proposal of Ravikumar et al. (2008). For large problems a 
forward stagewise approach such as boosting (Chapter 10) is more effective, 
and also allows for interactions to be included in the model. 
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9.2 Tree-Based Methods 

9.2.1 Background 

Tree-based methods partition the feature space into a set of rectangles, and 
then fit a simple model (like a constant) in each one. They are conceptually 
simple yet powerful. We first describe a popular method for tree-based 
regression and classification called CART, and later contrast it with C4.5, 
a major competitor. 

Let’s consider a regression problem with continuous response Y and in¬ 
puts X\ and X2, each taking values in the unit interval. The top left panel 
of Figure 9.2 shows a partition of the feature space by lines that are parallel 
to the coordinate axes. In each partition element we can model Y with a 
different constant. However, there is a problem: although each partitioning 
line has a simple description like Xi = c, some of the resulting regions are 
complicated to describe. 

To simplify matters, we restrict attention to recursive binary partitions 
like that in the top right panel of Figure 9.2. We first split the space into 
two regions, and model the response by the mean of Y in each region. 
We choose the variable and split-point to achieve the best fit. Then one 
or both of these regions are split into two more regions, and this process 
is continued, until some stopping rule is applied. For example, in the top 
right panel of Figure 9.2, we first split at X\ = t\. Then the region X\ < t\ 
is split at X 2 = f 2 and the region Xi > t\ is split at X\ = 1 3 . Finally, the 
region X\ > t$ is split at X 2 = t 4 . The result of this process is a partition 
into the five regions Ri, R2, ■ ■ ■, R5 shown in the figure. The corresponding 
regression model predicts Y with a constant c m in region R m , that is, 

5 

f{X) = J 2 CmI{{X 1 ,X 2 ) e R m }. ( 9 . 9 ) 

m= 1 

This same model can be represented by the binary tree in the bottom left 
panel of Figure 9.2. The full dataset sits at the top of the tree. Observations 
satisfying the condition at each junction are assigned to the left branch, 
and the others to the right branch. The terminal nodes or leaves of the 
tree correspond to the regions Ri, R2, ■ ■ ■, R5. The bottom right panel of 
Figure 9.2 is a perspective plot of the regression surface from this model. 
For illustration, we chose the node means Ci = —5, C 2 = — 7 ,C 3 = 0 ,C 4 = 
2, C 5 = 4 to make this plot. 

A key advantage of the recursive binary tree is its interpretability. The 
feature space partition is fully described by a single tree. With more than 
two inputs, partitions like that in the top right panel of Figure 9.2 are 
difficult to draw, but the binary tree representation works in the same 
way. This representation is also popular among medical scientists, perhaps 
because it mimics the way that a doctor thinks. The tree stratifies the 
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FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a 
two-dimensional feature space by recursive binary splitting, as used in CART, 
applied to some fake data. Top left panel shows a general partition that cannot 
be obtained from recursive binary splitting. Bottom left panel shows the tree cor¬ 
responding to the partition in the top right panel, and a perspective plot of the 
prediction surface appears in the bottom right panel. 
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population into strata of high and low outcome, on the basis of patient 
characteristics. 

9.2.2 Regression Trees 

We now turn to the question of how to grow a regression tree. Our data 
consists of p inputs and a response, for each of N observations: that is, 
(. Xi,yi ) for i = 1,2, ...,7V, with Xi = (xa, Xj 2 ,. ■., Xi P ). The algorithm 
needs to automatically decide on the splitting variables and split points, 
and also what topology (shape) the tree should have. Suppose first that we 
have a partition into M regions R\, R 2 , ... , Rm > and we model the response 
as a constant c m in each region: 

M 

f( x ) = ^2 e R rn)- (9-10) 

m= 1 

If we adopt as our criterion minimization of the sum of squares ~ 

/(xj)) 2 , it is easy to see that the best c m is just the average of y-i in region 
Rm • 


c m = ave(yi\xi £ R m ). 


(9.11) 


Now finding the best binary partition in terms of minimum sum of squares 
is generally computationally infeasible. Hence we proceed with a greedy 
algorithm. Starting with all of the data, consider a splitting variable j and 
split point s, and define the pair of half-planes 

Ri(j, a) = {X\Xj < 4 and R 2 (j, s) = {X\Xj > a}. (9.12) 

Then we seek the splitting variable j and split point s that solve 


J, a L ci 


^ ~ Cl ) 2 


) 2 + min V (yi - c 2 ) 2 . 

c 2 L ' J 

Xi€Ri(j,s) Xi^R 2 (j,s) 

For any choice j and s, the inner minimization is solved by 

ci = ave(yi\xi £ Ri(j,s)) and c 2 = ave(y l |x i £ R 2 (j,s)). 


(9.13) 


(9.14) 


For each splitting variable, the determination of the split point s can 
be done very quickly and hence by scanning through all of the inputs, 
determination of the best pair (j. s) is feasible. 

Having found the best split, we partition the data into the two resulting 
regions and repeat the splitting process on each of the two regions. Then 
this process is repeated on all of the resulting regions. 

How large should we grow the tree? Clearly a very large tree might overfit 
the data, while a small tree might not capture the important structure. 
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Tree size is a tuning parameter governing the model’s complexity, and the 
optimal tree size should be adaptively chosen from the data. One approach 
would be to split tree nodes only if the decrease in sum-of-squares due to the 
split exceeds some threshold. This strategy is too short-sighted, however, 
since a seemingly worthless split might lead to a very good split below it. 

The preferred strategy is to grow a large tree To, stopping the splitting 
process only when some minimum node size (say 5) is reached. Then this 
large tree is pruned using cost-complexity pruning , which we now describe. 

We define a subtree T C To to be any tree that can be obtained by 
pruning T 0 , that is, collapsing any number of its internal (non-terminal) 
nodes. We index terminal nodes by m, with node m representing region 
R m . Let |T| denote the number of terminal nodes in T. Letting 


Nm — ^{*^2 ^ Rm\ , 

Cm = W~ Vi > 


Xi^R n 


Qm{T) = ^r- Y, fo- 

m. , . 


XiERrr 


we define the cost complexity criterion 


m 


C a (T) = Y N mQm( T ) + a\T\. 


(9.15) 


(9.16) 


The idea is to find, for each a, the subtree T a C To to minimize C a (T). 
The tuning parameter a > 0 governs the tradeoff between tree size and its 
goodness of fit to the data. Large values of a result in smaller trees T a , and 
conversely for smaller values of a. As the notation suggests, with a = 0 the 
solution is the full tree T 0 . We discuss how to adaptively choose a below. 

For each a one can show that there is a unique smallest subtree T a that 
minimizes C a (T). To find T a we use weakest link pruning: we successively 
collapse the internal node that produces the smallest per-node increase in 
NmQm(T), and continue until we produce the single-node (root) tree. 
This gives a (finite) sequence of subtrees, and one can show this sequence 
must contain T a . See Breiman et al. (1984) or Ripley (1996) for details. 
Estimation of a is achieved by five- or tenfold cross-validation: we choose 
the value a to minimize the cross-validated sum of squares. Our final tree 
is T & . 


9.2.3 Classification Trees 

If the target is a classification outcome taking values 1, 2,..., K, the only 
changes needed in the tree algorithm pertain to the criteria for splitting 
nodes and pruning the tree. For regression we used the squared-error node 
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p 


FIGURE 9.3. Node impurity measures for two-class classification, as a function 
of the proportion p in class 2. Cross-entropy has been scaled to pass through 
(0.5, 0.5). 


impurity measure Q m (T) defined in (9.15), but this is not suitable for 
classification. In a node to, representing a region R m with N m observations, 
let 

- E 

Nm XieRrr 


Pmk = -ITT- E Hjji = k), 


the proportion of class k observations in node to. We classify the obser¬ 
vations in node m to class k(m) = argmaxfcp m fc, the majority class in 
node to. Different measures Q m (T) of node impurity include the following: 


Misclassification error: J2ieR m ^ k(m)) = 1 - p m k(m)- 

Gini index: Hk^k' PmkPmk> = J2k =l P™k( 1 - Pmk)- 

Cross-entropy or deviance: — Pmk log p m k- 

(9.17) 

For two classes, if p is the proportion in the second class, these three mea¬ 
sures are 1 — max(p, 1 — p), 2p(l — p) and — plogp — (1 — p) log (1 — p), 
respectively. They are shown in Figure 9.3. All three are similar, but cross¬ 
entropy and the Gini index are differentiable, and hence more amenable to 
numerical optimization. Comparing (9.13) and (9.15), we see that we need 
to weight the node impurity measures by the number N mL and N mR of 
observations in the two child nodes created by splitting node to. 

In addition, cross-entropy and the Gini index are more sensitive to changes 
in the node probabilities than the misclassification rate. For example, in 
a two-class problem with 400 observations in each class (denote this by 
(400,400)), suppose one split created nodes (300,100) and (100,300), while 
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the other created nodes (200,400) and (200,0). Both splits produce a mis- 
classification rate of 0.25, but the second split produces a pure node and is 
probably preferable. Both the Gini index and cross-entropy are lower for the 
second split. For this reason, either the Gini index or cross-entropy should 
be used when growing the tree. To guide cost-complexity pruning, any of 
the three measures can be used, but typically it is the misclassification rate. 

The Gini index can be interpreted in two interesting ways. Rather than 
classify observations to the majority class in the node, we could classify 
them to class k with probability p m k- Then the training error rate of this 
rule in the node is Y k^k’ VmkPmk' —the Gini index. Similarly, if we code 
each observation as 1 for class k and zero otherwise, the variance over the 
node of this 0-1 response is p m k{ 1 — Pmk)- Summing over classes k again 
gives the Gini index. 

9.2.4 Other Issues 

Categorical Predictors 

When splitting a predictor having q possible unordered values, there are 
2 q ~ 1 — 1 possible partitions of the q values into two groups, and the com¬ 
putations become prohibitive for large q. However, with a 0 — 1 outcome, 
this computation simplifies. We order the predictor classes according to the 
proportion falling in outcome class 1. Then we split this predictor as if it 
were an ordered predictor. One can show this gives the optimal split, in 
terms of cross-entropy or Gini index, among all possible 2 9_1 — 1 splits. This 
result also holds for a quantitative outcome and square error loss—the cat¬ 
egories are ordered by increasing mean of the outcome. Although intuitive, 
the proofs of these assertions are not trivial. The proof for binary outcomes 
is given in Breiman et al. (1984) and Ripley (1996); the proof for quantita¬ 
tive outcomes can be found in Fisher (1958). For multicategory outcomes, 
no such simplifications are possible, although various approximations have 
been proposed (Loh and Vanichsetakul, 1988). 

The partitioning algorithm tends to favor categorical predictors with 
many levels q ; the number of partitions grows exponentially in q , and the 
more choices we have, the more likely we can find a good one for the data 
at hand. This can lead to severe overfitting if q is large, and such variables 
should be avoided. 

The Loss Matrix 

In classification problems, the consequences of misclassifying observations 
are more serious in some classes than others. For example, it is probably 
worse to predict that a person will not have a heart attack when he/she 
actually will, than vice versa. To account for this, we define a K x K loss 
matrix L, with L^k 1 being the loss incurred for classifying a class k obser¬ 
vation as class k!. Typically no loss is incurred for correct classifications, 
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that is, Lkk = 0 Vfc. To incorporate the losses into the modeling process, 
we could modify the Gini index to Lkk'PmkPmk r , this would be the 

expected loss incurred by the randomized rule. This works for the multi¬ 
class case, but in the two-class case has no effect, since the coefficient of 
PmkPmk' is Lkk ' + For two classes a better approach is to weight the 
observations in class k by Lkk 1 ■ This can be used in the multiclass case only 
if, as a function of k, Lkk 1 doesn’t depend on k'. Observation weighting can 
be used with the deviance as well. The effect of observation weighting is to 
alter the prior probability on the classes. In a terminal node, the empirical 
Bayes rule implies that we classify to class k{m) = argmin^ L(kP m e- 

Missing Predictor Values 

Suppose our data has some missing predictor values in some or all of the 
variables. We might discard any observation with some missing values, but 
this could lead to serious depletion of the training set. Alternatively we 
might try to fill in (impute) the missing values, with say the mean of that 
predictor over the nonmissing observations. For tree-based models, there 
are two better approaches. The first is applicable to categorical predictors: 
we simply make a new category for “missing.” From this we might dis¬ 
cover that observations with missing values for some measurement behave 
differently than those with nonmissing values. The second more general 
approach is the construction of surrogate variables. When considering a 
predictor for a split, we use only the observations for which that predictor 
is not missing. Having chosen the best (primary) predictor and split point, 
we form a list of surrogate predictors and split points. The first surrogate 
is the predictor and corresponding split point that best mimics the split of 
the training data achieved by the primary split. The second surrogate is 
the predictor and corresponding split point that does second best, and so 
on. When sending observations down the tree either in the training phase 
or during prediction, we use the surrogate splits in order, if the primary 
splitting predictor is missing. Surrogate splits exploit correlations between 
predictors to try and alleviate the effect of missing data. The higher the cor¬ 
relation between the missing predictor and the other predictors, the smaller 
the loss of information due to the missing value. The general problem of 
missing data is discussed in Section 9.6. 

Why Binary Splits? 

Rather than splitting each node into just two groups at each stage (as 
above), we might consider multiway splits into more than two groups. While 
this can sometimes be useful, it is not a good general strategy. The problem 
is that multiway splits fragment the data too quickly, leaving insufficient 
data at the next level down. Hence we would want to use such splits only 
when needed. Since multiway splits can be achieved by a series of binary 
splits, the latter are preferred. 
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Other Tree-Building Procedures 

The discussion above focuses on the CART (classification and regression 
tree) implementation of trees. The other popular methodology is ID3 and 
its later versions, C4.5 and C5.0 (Quinlan, 1993). Early versions of the 
program were limited to categorical predictors, and used a top-down rule 
with no pruning. With more recent developments, C5.0 has become quite 
similar to CART. The most significant feature unique to C5.0 is a scheme 
for deriving rule sets. After a tree is grown, the splitting rules that define the 
terminal nodes can sometimes be simplified: that is, one or more condition 
can be dropped without changing the subset of observations that fall in 
the node. We end up with a simplified set of rules defining each terminal 
node; these no longer follow a tree structure, but their simplicity might 
make them more attractive to the user. 

Linear Combination Splits 

Rather than restricting splits to be of the form Xj < s, one can allow splits 
along linear combinations of the form ^ cijXj < s. The weights aj and 
split point s are optimized to minimize the relevant criterion (such as the 
Gini index). While this can improve the predictive power of the tree, it can 
hurt interpretability. Computationally, the discreteness of the split point 
search precludes the use of a smooth optimization for the weights. A better 
way to incorporate linear combination splits is in the hierarchical mixtures 
of experts (HME) model, the topic of Section 9.5. 

Instability of Trees 

One major problem with trees is their high variance. Often a small change 
in the data can result in a very different series of splits, making interpre¬ 
tation somewhat precarious. The major reason for this instability is the 
hierarchical nature of the process: the effect of an error in the top split 
is propagated down to all of the splits below it. One can alleviate this to 
some degree by trying to use a more stable split criterion, but the inherent 
instability is not removed. It is the price to be paid for estimating a simple, 
tree-based structure from the data. Bagging (Section 8.7) averages many 
trees to reduce this variance. 

Lack of Smoothness 

Another limitation of trees is the lack of smoothness of the prediction sur¬ 
face, as can be seen in the bottom right panel of Figure 9.2. In classification 
with 0/1 loss, this doesn’t hurt much, since bias in estimation of the class 
probabilities has a limited effect. However, this can degrade performance 
in the regression setting, where we would normally expect the underlying 
function to be smooth. The MARS procedure, described in Section 9.4, 
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TABLE 9.3. Spam data: confusion rates for the 17 -node tree (chosen by cross- 
validation) on the test data. Overall error rate is 9.3%. 



Predicted 

True 

email 

spam 

email 

57.3% 

4 . 0 % 

spam 

5.3% 

33 . 4 % 


can be viewed as a modification of CART designed to alleviate this lack of 
smoothness. 

Difficulty in Capturing Additive Structure 

Another problem with trees is their difficulty in modeling additive struc¬ 
ture. In regression, suppose, for example, that Y = Ci I(X\ < ti)+C 2 /(A! 2 < 
t 2 ) + £ where e is zero-mean noise. Then a binary tree might make its first 
split on Xi near £ 1 . At the next level down it would have to split both nodes 
on X 2 at £2 in order to capture the additive structure. This might happen 
with sufficient data, but the model is given no special encouragement to find 
such structure. If there were ten rather than two additive effects, it would 
take many fortuitous splits to recreate the structure, and the data analyst 
would be hard pressed to recognize it in the estimated tree. The “blame” 
here can again be attributed to the binary tree structure, which has both 
advantages and drawbacks. Again the MARS method (Section 9.4) gives 
up this tree structure in order to capture additive structure. 


9.2.5 Spam Example (Continued) 

We applied the classification tree methodology to the spam example intro¬ 
duced earlier. We used the deviance measure to grow the tree and mis- 
classification rate to prune it. Figure 9.4 shows the 10-fold cross-validation 
error rate as a function of the size of the pruned tree, along with ±2 stan¬ 
dard errors of the mean, from the ten replications. The test error curve is 
shown in orange. Note that the cross-validation error rates are indexed by 
a sequence of values of a and not tree size; for trees grown in different folds, 
a value of a might imply different sizes. The sizes shown at the base of the 
plot refer to \T a \, the sizes of the pruned original tree. 

The error flattens out at around 17 terminal nodes, giving the pruned tree 
in Figure 9.5. Of the 13 distinct features chosen by the tree, 11 overlap with 
the 16 significant features in the additive model (Table 9.2). The overall 
error rate shown in Table 9.3 is about 50% higher than for the additive 
model in Table 9.1. 

Consider the rightmost branches of the tree. We branch to the right 
with a spam warning if more than 5.5% of the characters are the $ sign. 
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FIGURE 9.4. Results for spam example. The blue curve is the 10 -fold cross-val¬ 
idation estimate of misclassification rate as a function of tree size, with standard 
error bars. The minimum occurs at a tree size with about 17 terminal nodes (using 
the “one-standard-error” rule). The orange curve is the test error, which tracks 
the CV error quite closely. The cross-validation is indexed by values of a, shown 
above. The tree sizes shown below refer to \T a \, the size of the original tree indexed 
by a. 


However, if in addition the phrase hp occurs frequently, then this is likely 
to be company business and we classify as email. All of the 22 cases in 
the test set satisfying these criteria were correctly classified. If the second 
condition is not met, and in addition the average length of repeated capital 
letters CAPAVE is larger than 2.9, then we classify as spam. Of the 227 test 
cases, only seven were misclassified. 

In medical classification problems, the terms sensitivity and specificity 
are used to characterize a rule. They are defined as follows: 


Sensitivity: probability of predicting disease given true state is disease. 


Specificity: probability of predicting non-disease given true state is non¬ 
disease. 
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free<0.065 



FIGURE 9.5. The pruned tree for the spam example. The split variables are 
shown in blue on the branches, and the classification is shown in every node. The 
numbers under the terminal nodes indicate mis classification rates on the test data. 
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0.0 0.2 0.4 0.6 0.8 1.0 


Specificity 

FIGURE 9.6. ROC curves for the classification rules fit to the spam data. Curves 
that are closer to the northeast corner represent better classifiers. In this case the 
GAM classifier dominates the trees. The weighted tree achieves better sensitivity 
for higher specificity than the unweighted tree. The numbers in the legend repre¬ 
sent the area under the curve. 


If we think of spam and email as the presence and absence of disease, re¬ 
spectively, then from Table 9.3 we have 

33.4 

Sensitivity = 100 x ---— = 86.3%, 

33.4 + 5.3 

57 3 

Specificity = 100 x -= 93.4%. 

57.3 + 4.0 

In this analysis we have used equal losses. As before let Lkk' be the 
loss associated with predicting a class k object as class k'. By varying the 
relative sizes of the losses Lot an d +io, we increase the sensitivity and 
decrease the specificity of the rule, or vice versa. In this example, we want 
to avoid marking good email as spam, and thus we want the specificity to 
be very high. We can achieve this by setting L m > 1 say, with L w = 1. 
The Bayes’ rule in each terminal node classifies to class 1 (spam) if the 
proportion of spam is > Loi/(-^io + Toi); an d class zero otherwise. The 
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receiver operating characteristic curve (ROC) is a commonly used summary 
for assessing the tradeoff between sensitivity and specificity. It is a plot of 
the sensitivity versus specificity as we vary the parameters of a classification 
rule. Varying the loss L m between 0.1 and 10, and applying Bayes’ rule to 
the 17-node tree selected in Figure 9.4, produced the ROC curve shown 
in Figure 9.6. The standard error of each curve near 0.9 is approximately 
•p0.9(l — 0.9)/1536 = 0.008, and hence the standard error of the difference 
is about 0.01. We see that in order to achieve a specificity of close to 100%, 
the sensitivity has to drop to about 50%. The area under the curve is a 
commonly used quantitative summary; extending the curve linearly in each 
direction so that it is defined over [0,100], the area is approximately 0.95. 
For comparison, we have included the ROC curve for the GAM model fit 
to these data in Section 9.2; it gives a better classification rule for any loss, 
with an area of 0.98. 

Rather than just modifying the Bayes rule in the nodes, it is better to 
take full account of the unequal losses in growing the tree, as was done 
in Section 9.2. With just two classes 0 and 1, losses may be incorporated 
into the tree-growing process by using weight Lk,i~k for an observation in 
class k. Here we chose Loi = 5, L io = 1 and fit the same size tree as before 
(|T a | = 17). This tree has higher sensitivity at high values of the specificity 
than the original tree, but does more poorly at the other extreme. Its top 
few splits are the same as the original tree, and then it departs from it. 
For this application the tree grown using Lqi = 5 is clearly better than the 
original tree. 

The area under the ROC curve, used above, is sometimes called the c- 
statistic. Interestingly, it can be shown that the area under the ROC curve 
is equivalent to the Mann-Whitney U statistic (or Wilcoxon rank-sum test), 
for the median difference between the prediction scores in the two groups 
(Hanley and McNeil, 1982). For evaluating the contribution of an additional 
predictor when added to a standard model, the c-statistic may not be an 
informative measure. The new predictor can be very significant in terms 
of the change in model deviance, but show only a small increase in the c- 
statistic. For example, removal of the highly significant term george from 
the model of Table 9.2 results in a decrease in the c-statistic of less than 
0.01. Instead, it is useful to examine how the additional predictor changes 
the classification on an individual sample basis. A good discussion of this 
point appears in Cook (2007). 


9.3 PRIM: Bump Hunting 

Tree-based methods (for regression) partition the feature space into box¬ 
shaped regions, to try to make the response averages in each box as differ- 
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ent as possible. The splitting rules defining the boxes are related to each 
through a binary tree, facilitating their interpretation. 

The patient rule induction method (PRIM) also finds boxes in the feature 
space, but seeks boxes in which the response average is high. Hence it looks 
for maxima in the target function, an exercise known as bump hunting. (If 
minima rather than maxima are desired, one simply works with the negative 
response values.) 

PRIM also differs from tree-based partitioning methods in that the box 
definitions are not described by a binary tree. This makes interpretation of 
the collection of rules more difficult; however, by removing the binary tree 
constraint, the individual rules are often simpler. 

The main box construction method in PRIM works from the top down, 
starting with a box containing all of the data. The box is compressed along 
one face by a small amount, and the observations then falling outside the 
box are peeled off. The face chosen for compression is the one resulting in 
the largest box mean, after the compression is performed. Then the process 
is repeated, stopping when the current box contains some minimum number 
of data points. 

This process is illustrated in Figure 9.7. There are 200 data points uni¬ 
formly distributed over the unit square. The color-coded plot indicates the 
response Y taking the value 1 (red) when 0.5 < X\ < 0.8 and 0.4 < X 2 < 
0.6. and zero (blue) otherwise. The panels shows the successive boxes found 
by the top-down peeling procedure, peeling off a proportion a = 0.1 of the 
remaining data points at each stage. 

Figure 9.8 shows the mean of the response values in the box, as the box 
is compressed. 

After the top-down sequence is computed, PRIM reverses the process, 
expanding along any edge, if such an expansion increases the box mean. 
This is called pasting. Since the top-down procedure is greedy at each step, 
such an expansion is often possible. 

The result of these steps is a sequence of boxes, with different numbers 
of observation in each box. Cross-validation, combined with the judgment 
of the data analyst, is used to choose the optimal box size. 

Denote by B\ the indices of the observations in the box found in step 1. 
The PRIM procedure then removes the observations in B\ from the training 
set, and the two-step process—top down peeling, followed by bottom-up 
pasting—is repeated on the remaining dataset. This entire process is re¬ 
peated several times, producing a sequence of boxes Bi, B%,..., B &. Each 
box is defined by a set of rules involving a subset of predictors like 

(ai < X\ < b±) and (b\ < X 3 < b 2 )- 

A summary of the PRIM procedure is given Algorithm 9.3. 

PRIM can handle a categorical predictor by considering all partitions of 
the predictor, as in CART. Missing values are also handled in a manner 
similar to CART. PRIM is designed for regression (quantitative response 
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FIGURE 9.7. Illustration of PRIM algorithm. There are two classes, indicated 
by the blue (class 0 ) and red (class 1 ) points. The procedure starts with a rectangle 
(broken black lines) surrounding all of the data, and then peels away points along 
one edge by a prespecified amount in order to maximize the mean of the points 
remaining in the box. Starting at the top left panel, the sequence of peelings is 
shown, until a pure red region is isolated in the bottom right panel. The iteration 
number is indicated at the top of each panel. 
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FIGURE 9.8. Box mean as a function of number of observations in the box. 
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Algorithm 9.3 Patient Rule Induction Method. 

1. Start with all of the training data, and a maximal box containing all 
of the data. 

2. Consider shrinking the box by compressing one face, so as to peel off 
the proportion a of observations having either the highest values of 
a predictor Xj , or the lowest. Choose the peeling that produces the 
highest response mean in the remaining box. (Typically a = 0.05 or 
0 . 10 .) 

3. Repeat step 2 until some minimal number of observations (say 10) 
remain in the box. 

4. Expand the box along any face, as long as the resulting box mean 
increases. 

5. Steps 1-4 give a sequence of boxes, with different numbers of obser¬ 
vations in each box. Use cross-validation to choose a member of the 
sequence. Call the box B\. 

6 . Remove the data in box B\ from the dataset and repeat steps 2-5 to 
obtain a second box, and continue to get as many boxes as desired. 


variable); a two-class outcome can be handled simply by coding it as 0 and 
1. There is no simple way to deal with k > 2 classes simultaneously: one 
approach is to run PRIM separately for each class versus a baseline class. 

An advantage of PRIM over CART is its patience. Because of its bi¬ 
nary splits, CART fragments the data quite quickly. Assuming splits of 
equal size, with N observations it can only make log 2 (IV) — 1 splits before 
running out of data. If PRIM peels off a proportion a of training points 
at each stage, it can perform approximately — log(iV)/log(l — a) peeling 
steps before running out of data. For example, if N = 128 and a = 0.10, 
then log 2 (IV) — 1 = 6 while — log(IV)/ log(l — a) « 46. Taking into account 
that there must be an integer number of observations at each stage, PRIM 
in fact can peel only 29 times. In any case, the ability of PRIM to be more 
patient should help the top-down greedy algorithm find a better solution. 


9.3.1 Spam Example (Continued) 

We applied PRIM to the spam data, with the response coded as 1 for spam 
and 0 for email. 

The first two boxes found by PRIM are summarized below: 
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Rule 1 

Global Mean 

Box Mean 

Box Support 

Training 

0.3931 

0.9607 

0.1413 

Test 

0.3958 

1.0000 

0.1536 


ch! 

> 

0.029 

CAPAVE 

> 

2.331 

your 

> 

0.705 

1999 

< 

0.040 

CAPT0T 

> 

79.50 

edu 

< 

0.070 

re 

< 

0.535 

ch; 

< 

0.030 


Rule 2 

Remain Mean 

Box Mean 

Box Support 

Training 

0.2998 

0.9560 

0.1043 

Test 

0.2862 

0.9264 

0.1061 


„ . _ f remove > 0.010 

Rule 2 < „ ,, „ 

( george < 0.110 

The box support is the proportion of observations falling in the box. 
The first box is purely spam, and contains about 15% of the test data. 
The second box contains 10.6% of the test observations, 92.6% of which 
are spam. Together the two boxes contain 26% of the data and are about 
97% spam. The next few boxes (not shown) are quite small, containing only 
about 3% of the data. 

The predictors are listed in order of importance. Interestingly the top 
splitting variables in the CART tree (Figure 9.5) do not appear in PRIM’s 
first box. 


9.4 MARS: Multivariate Adaptive Regression 
Splines 

MARS is an adaptive procedure for regression, and is well suited for high¬ 
dimensional problems (i.e., a large number of inputs). It can be viewed as a 
generalization of stepwise linear regression or a modification of the CART 
method to improve the latter’s performance in the regression setting. We 
introduce MARS from the first point of view, and later make the connection 
to CART. 

MARS uses expansions in piecewise linear basis functions of the form 
(x — f)+ and (t — x)+. The “+” 

, s _ j x — t, if x > t, 

■ ' + \ 0, otherwise, 


means positive part, so 


and (t—x)+ = 


t — x, 

0 , 


if x < t, 
otherwise. 
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FIGURE 9.9. The basis functions (x — t)+ (solid orange) and (t — x)+ (broken 
blue) used by MARS. 

As an example, the functions (x — 0.5)+ and (0.5 — a;)+ are shown in Fig¬ 
ure 9.9. 

Each function is piecewise linear, with a knot at the value t. In the 
terminology of Chapter 5, these are linear splines. We call the two functions 
a reflected pair in the discussion below. The idea is to form reflected pairs 
for each input Xj with knots at each observed value Xij of that input. 
Therefore, the collection of basis functions is 

C = {( x j - t)+, (t Xj )+} (9.18) 

3 = 1 , 2 ,... ,p. 

If all of the input values are distinct, there are 2 Np basis functions alto¬ 
gether. Note that although each basis function depends only on a single 
Xj, for example, h(X) = (Xj — f)+, it is considered as a function over the 
entire input space 1R P . 

The model-building strategy is like a forward stepwise linear regression, 
but instead of using the original inputs, we are allowed to use functions 
from the set C and their products. Thus the model has the form 

M 

f(X)=(3 0 +J2Pm h m( X ), (9.19) 

m =1 

where each h m (X) is a function in C, or a product of two or more such 
functions. 

Given a choice for the h m , the coefficients /3 m are estimated by minimiz¬ 
ing the residual sum-of-squares, that is, by standard linear regression. The 
real art, however, is in the construction of the functions h m (x). We start 
with only the constant function ho(X) = 1 in our model, and all functions 
in the set C are candidate functions. This is depicted in Figure 9.10. 

At each stage we consider as a new basis function pair all products of a 
function h m in the model set M. with one of the reflected pairs in C. We 
add to the model A4 the term of the form 

@m+ ihe(X) ■ (Xj — t)+ + p M +2he(X) ■ (t — A+)+, he € A4, 
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FIGURE 9.10. Schematic of the MARS forward model-building procedure. On 
the left are the basis functions currently in the model: initially, this is the constant 
function h(X) = 1. On the right are all candidate basis functions to be considered 
in building the model. These are pairs of piecewise linear basis functions as in 
Figure 9.9, with knots t at all unique observed values Xij of each predictor Xj. 
At each stage we consider all products of a candidate pair with a basis function 
in the model. The product that decreases the residual error the most is added into 
the current model. Above we illustrate the first three steps of the procedure, with 
the selected functions shown in red. 
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h(X i,X 2 ) 



FIGURE 9.11. The function h(Xi,X 2 ) = (Xi —xsi)+ ■ (x 7 2 — X 2 )+, resulting 
from multiplication of two piecewise linear MARS basis functions. 

that produces the largest decrease in training error. Here $m+i and $m+ 2 
are coefficients estimated by least squares, along with all the other M + 1 
coefficients in the model. Then the winning products are added to the 
model and the process is continued until the model set A4 contains some 
preset maximum number of terms. 

For example, at the first stage we consider adding to the model a function 
of the form /3 7 (Xj — t)+ + ^{t — Xj) + ] t € since multiplication by 

the constant function just produces the function itself. Suppose the best 
choice is $i(X 2 — £ 72 )+ + ,$ 2(£72 — X 2 ) + . Then this pair of basis functions 
is added to the set M, and at the next stage we consider including a pair 
of products the form 


h m (X) ■ (Xj - t)+ and h m (X) ■ (t - Xj)+, t e {xij}, 
where for h m we have the choices 

h 0 (X) = 1, 

hi(X) = (X 2 -x 72 )+, or 

h 2 {X) = (x 72 — X 2 ) + . 

The third choice produces functions such as {X\ — 2 : 51 )+ • {x 72 — X 2 ) + , 
depicted in Figure 9.11. 

At the end of this process we have a large model of the form (9.19). This 
model typically overfits the data, and so a backward deletion procedure 
is applied. The term whose removal causes the smallest increase in resid¬ 
ual squared error is deleted from the model at each stage, producing an 
estimated best model f\ of each size (number of terms) A. One could use 
cross-validation to estimate the optimal value of A, but for computational 
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savings the MARS procedure instead uses generalized cross-validation. This 
criterion is defined as 


GCV(A) 


Z? = 1 (Vi-h(xi )) 2 

(1-M(A)/7V) 2 


(9.20) 


The value M( A) is the effective number of parameters in the model: this 
accounts both for the number of terms in the models, plus the number 
of parameters used in selecting the optimal positions of the knots. Some 
mathematical and simulation results suggest that one should pay a price 
of three parameters for selecting a knot in a piecewise linear regression. 

Thus if there are r linearly independent basis functions in the model, and 
K knots were selected in the forward process, the formula is M(A) = r+cK, 
where c = 3. (When the model is restricted to be additive—details below— 
a penalty of c = 2 is used). Using this, we choose the model along the 
backward sequence that minimizes GCV(A). 

Why these piecewise linear basis functions, and why this particular model 
strategy? A key property of the functions of Figure 9.9 is their ability to 
operate locally; they are zero over part of their range. When they are mul¬ 
tiplied together, as in Figure 9.11, the result is nonzero only over the small 
part of the feature space where both component functions are nonzero. As 
a result, the regression surface is built up parsimoniously, using nonzero 
components locally—only where they are needed. This is important, since 
one should “spend” parameters carefully in high dimensions, as they can 
run out quickly. The use of other basis functions such as polynomials, would 
produce a nonzero product everywhere, and would not work as well. 

The second important advantage of the piecewise linear basis function 
concerns computation. Consider the product of a function in A4 with each 
of the N reflected pairs for an input Xj. This appears to require the fitting 
of N single-input linear regression models, each of which uses O(N) oper¬ 
ations, making a total of 0(N 2 ) operations. However, we can exploit the 
simple form of the piecewise linear function. We first fit the reflected pair 
with rightmost knot. As the knot is moved successively one position at a 
time to the left, the basis functions differ by zero over the left part of the 
domain, and by a constant over the right part. Hence after each such move 
we can update the fit in 0(1) operations. This allows us to try every knot 
in only O(N) operations. 

The forward modeling strategy in MARS is hierarchical, in the sense that 
multiway products are built up from products involving terms already in 
the model. For example, a four-way product can only be added to the model 
if one of its three-way components is already in the model. The philosophy 
here is that a high-order interaction will likely only exist if some of its lower- 
order “footprints” exist as well. This need not be true, but is a reasonable 
working assumption and avoids the search over an exponentially growing 
space of alternatives. 
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FIGURE 9.12. Spam data: test error misclassification rate for the MARS pro¬ 
cedure, as a function of the rank (number of independent basis functions) in the 
model. 


There is one restriction put on the formation of model terms: each input 
can appear at most once in a product. This prevents the formation of 
higher-order powers of an input, which increase or decrease too sharply 
near the boundaries of the feature space. Such powers can be approximated 
in a more stable way with piecewise linear functions. 

A useful option in the MARS procedure is to set an upper limit on 
the order of interaction. For example, one can set a limit of two, allowing 
pairwise products of piecewise linear functions, but not three- or higher¬ 
way products. This can aid in the interpretation of the final model. An 
upper limit of one results in an additive model. 


9-4-1 Spam Example (Continued) 

We applied MARS to the “spam” data analyzed earlier in this chapter. To 
enhance interpretability, we restricted MARS to second-degree interactions. 
Although the target is a two-class variable, we used the squared-error loss 
function nonetheless (see Section 9.4.3). Figure 9.12 shows the test error 
misclassification rate as a function of the rank (number of independent ba¬ 
sis functions) in the model. The error rate levels off at about 5.5%, which is 
slightly higher than that of the generalized additive model (5.3%) discussed 
earlier. GCV chose a model size of 60, which is roughly the smallest model 
giving optimal performance. The leading interactions found by MARS in¬ 
volved inputs (ch$, remove), (ch$, free) and (hp, CAPT0T). However, these 
interactions give no improvement in performance over the generalized ad¬ 
ditive model. 
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9-4-2 Example (Simulated Data) 

Here we examine the performance of MARS in three contrasting scenarios. 
There are N = 100 observations, and the predictors X±, X^, ■ ■., X p and 
errors e have independent standard normal distributions. 

Scenario 1: The data generation model is 


Y = (X 1 - 1)+ + (Xi - 1)+ • (X 2 - .8)+ + 0.12 • e. (9.21) 


The noise standard deviation 0.12 was chosen so that the signal-to- 
noise ratio was about 5. We call this the tensor-product scenario; the 
product term gives a surface that looks like that of Figure 9.11. 

Scenario 2: This is the same as scenario 1, but withp = 20 total predictors; 
that is, there are 18 inputs that are independent of the response. 

Scenario 3: This has the structure of a neural network: 


— Xi + X2 + X3 + X4 + X5, 



(9.22) 


Scenarios 1 and 2 are ideally suited for MARS, while scenario 3 contains 
high-order interactions and may be difficult for MARS to approximate. We 
ran five simulations from each model, and recorded the results. 

In scenario 1, MARS typically uncovered the correct model almost per¬ 
fectly. In scenario 2, it found the correct structure but also found a few 
extraneous terms involving other predictors. 

Let u( x ) be the true mean of Y, and let 


MSE 0 = ave xgT est(y ^ M^)) 2 , 
MSE = ave xgT est {f [x) - p,{x)) 2 ■ 


(9.23) 


These represent the mean-square error of the constant model and the fitted 
MARS model, estimated by averaging at the 1000 test values of x. Table 9.4 
shows the proportional decrease in model error or R 2 for each scenario: 


MSE 0 - MSE 
MSEq 


(9.24) 


The values shown are means and standard error over the five simulations. 
The performance of MARS is degraded only slightly by the inclusion of the 
useless inputs in scenario 2; it performs substantially worse in scenario 3. 
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TABLE 9.4. Proportional decrease in model error (R 2 ) when MARS is applied 
to three different scenarios. 


Scenario 

Mean (S.E.) 

1: Tensor product p = 2 

2: Tensor product p = 20 
3: Neural network 

0.97 (0.01) 
0.96 (0.01) 
0.79 (0.01) 


9-4-3 Other Issues 

MARS for Classification 

The MARS method and algorithm can be extended to handle classification 
problems. Several strategies have been suggested. 

For two classes, one can code the output as 0/1 and treat the problem as 
a regression; we did this for the spam example. For more than two classes, 
one can use the indicator response approach described in Section 4.2. One 
codes the K response classes via 0/1 indicator variables, and then per¬ 
forms a multi-response MARS regression. For the latter we use a common 
set of basis functions for all response variables. Classification is made to 
the class with the largest predicted response value. There are, however, po¬ 
tential masking problems with this approach, as described in Section 4.2. 
A generally superior approach is the “optimal scoring” method discussed 
in Section 12.5. 

Stone et al. (1997) developed a hybrid of MARS called PolyMARS specif¬ 
ically designed to handle classification problems. It uses the multiple logistic 
framework described in Section 4.4. It grows the model in a forward stage- 
wise fashion like MARS, but at each stage uses a quadratic approximation 
to the multinomial log-likelihood to search for the next basis-function pair. 
Once found, the enlarged model is fit by maximum likelihood, and the 
process is repeated. 

Relationship of MARS to CART 

Although they might seem quite different, the MARS and CART strategies 
actually have strong similarities. Suppose we take the MARS procedure and 
make the following changes: 

• Replace the piecewise linear basis functions by step functions I(x—t > 
0 ) and I(x — t< 0). 

• When a model term is involved in a multiplication by a candidate 
term, it gets replaced by the interaction, and hence is not available 
for further interactions. 

With these changes, the MARS forward procedure is the same as the CART 
tree-growing algorithm. Multiplying a step function by a pair of reflected 
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step functions is equivalent to splitting a node at the step. The second 
restriction implies that a node may not be split more than once, and leads 
to the attractive binary-tree representation of the CART model. On the 
other hand, it is this restriction that makes it difficult for CART to model 
additive structures. MARS forgoes the tree structure and gains the ability 
to capture additive effects. 


Mixed Inputs 

Mars can handle “mixed” predictors—quantitative and qualitative—in a 
natural way, much like CART does. MARS considers all possible binary 
partitions of the categories for a qualitative predictor into two groups. 
Each such partition generates a pair of piecewise constant basis functions— 
indicator functions for the two sets of categories. This basis pair is now 
treated as any other, and is used in forming tensor products with other 
basis functions already in the model. 


9.5 Hierarchical Mixtures of Experts 

The hierarchical mixtures of experts (HME) procedure can be viewed as a 
variant of tree-based methods. The main difference is that the tree splits 
are not hard decisions but rather soft probabilistic ones. At each node an 
observation goes left or right with probabilities depending on its input val¬ 
ues. This has some computational advantages since the resulting parameter 
optimization problem is smooth, unlike the discrete split point search in the 
tree-based approach. The soft splits might also help in prediction accuracy 
and provide a useful alternative description of the data. 

There are other differences between HMEs and the CART implementa¬ 
tion of trees. In an HME, a linear (or logistic regression) model is fit in 
each terminal node, instead of a constant as in CART. The splits can be 
multiway, not just binary, and the splits are probabilistic functions of a 
linear combination of inputs, rather than a single input as in the standard 
use of CART. However, the relative merits of these choices are not clear, 
and most were discussed at the end of Section 9.2. 

A simple two-level HME model in shown in Figure 9.13. It can be thought 
of as a tree with soft splits at each non-terminal node. However, the inven¬ 
tors of this methodology use a different terminology. The terminal nodes 
are called experts , and the non-terminal nodes are called gating networks. 
The idea is that each expert provides an opinion (prediction) about the 
response, and these are combined together by the gating networks. As we 
will see, the model is formally a mixture model, and the two-level model 
in the figure can be extend to multiple levels, hence the name hierarchical 
mixtures of experts. 
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FIGURE 9.13. A two-level hierarchical mixture of experts (HME) model. 


Consider the regression or classification problem, as described earlier in 
the chapter. The data is (xi,iji),i = 1,2,..., iV, with y, either a continuous 
or binary-valued response, and Xi a vector-valued input. For ease of nota¬ 
tion we assume that the first element of Xi is one, to account for intercepts. 

Here is how an HME is defined. The top gating network has the output 

T 

9ji x >7j) = --5^. 3 = 1,2,...,K, (9.25) 

£fc=i elk 

where each 7 j is a vector of unknown parameters. This represents a soft 
A"-way split (K = 2 in Figure 9.13.) Each gj(x,jj) is the probability of 
assigning an observation with feature vector x to the jth branch. Notice 
that with K = 2 groups, if we take the coefficient of one of the elements of 
x to be +00, then we get a logistic curve with infinite slope. In this case, 
the gating probabilities are either 0 or 1, corresponding to a hard split on 
that input. 

At the second level, the gating networks have a similar form: 

e ije x 

9e\j(x,Jjc) = e=l,2,...,K. (9.26) 

£ fc =1 e jk 








9.5 Hierarchical Mixtures of Experts 331 


This is the probability of assignment to the fth branch, given assignment 
to the jth branch at the level above. 

At each expert (terminal node), we have a model for the response variable 
of the form 


Y~-Pi(y\x,e jt ). (9.27) 

This differs according to the problem. 

Regression: The Gaussian linear regression model is used, with 9jt = 

Y = 0f t x + £ and e ~ iV(0, <rj t ). (9.28) 

Classification: The linear logistic regression model is used: 

Pr (Y = l\x,0 jt ) = -Vt- ( 9 - 29 ) 

1 + e ^ 

Denoting the collection of all parameters by T = the total 

probability that Y = y is 

K K 

PrG/l®,*) = ’529j(x,'Yj) 1 52ge\j{x,'yje)P*(y\x,0 j e)- (9-30) 

3=1 1=1 

This is a mixture model, with the mixture probabilities determined by the 
gating network models. 

To estimate the parameters, we maximize the log-likelihood of the data, 
log Yi{yi\xi, \H), over the parameters in T. The most convenient method 
for doing this is the EM algorithm, which we describe for mixtures in 
Section 8.5. We define latent variables A,-, all of which are zero except for 
a single one. We interpret these as the branching decisions made by the top 
level gating network. Similarly we define latent variables to describe 
the gating decisions at the second level. 

In the E-step, the EM algorithm computes the expectations of the A j 
and Aqj given the current values of the parameters. These expectations 
are then used as observation weights in the M-step of the procedure, to 
estimate the parameters in the expert networks. The parameters in the 
internal nodes are estimated by a version of multiple logistic regression. 
The expectations of the Aj or Agy are probability profiles, and these are 
used as the response vectors for these logistic regressions. 

The hierarchical mixtures of experts approach is a promising competitor 
to CART trees. By using soft splits rather than hard decision rules it can 
capture situations where the transition from low to high response is gradual. 
The log-likelihood is a smooth function of the unknown weights and hence 
is amenable to numerical optimization. The model is similar to CART with 
linear combination splits, but the latter is more difficult to optimize. On 
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the other hand, to our knowledge there are no methods for finding a good 
tree topology for the HME model, as there are in CART. Typically one uses 
a fixed tree of some depth, possibly the output of the CART procedure. 
The emphasis in the research on HMEs has been on prediction rather than 
interpretation of the final model. A close cousin of the HME is the latent 
class model (Lin et ah, 2000), which typically has only one layer; here 
the nodes or latent classes are interpreted as groups of subjects that show 
similar response behavior. 


9.6 Missing Data 

It is quite common to have observations with missing values for one or more 
input features. The usual approach is to impute (fill-in) the missing values 
in some way. 

However, the first issue in dealing with the problem is determining wheth¬ 
er the missing data mechanism has distorted the observed data. Roughly 
speaking, data are missing at random if the mechanism resulting in its 
omission is independent of its (unobserved) value. A more precise definition 
is given in Little and Rubin (2002). Suppose y is the response vector and X 
is the N x p matrix of inputs (some of which are missing). Denote by X Q b s 
the observed entries in X and let Z = (y, X), Z Q b s = (y, X 0 b s )- Finally, if R 
is an indicator matrix with ij th entry 1 if Xij is missing and zero otherwise, 
then the data is said to be missing at random (MAR) if the distribution of 
R depends on the data Z only through Z Q b s : 

Pr(R|Z, 0) = Pr(R|Z obs , 0). (9.31) 

Here 6 are any parameters in the distribution of R. Data are said to be 
missing completely at random (MCAR) if the distribution of R doesn’t 
depend on the observed or missing data: 

Pr(R|Z,0) = Pr(R|0). (9.32) 

MCAR is a stronger assumption than MAR: most imputation methods rely 
on MCAR for their validity. 

For example, if a patient’s measurement was not taken because the doctor 
felt he was too sick, that observation would not be MAR or MCAR. In this 
case the missing data mechanism causes our observed training data to give a 
distorted picture of the true population, and data imputation is dangerous 
in this instance. Often the determination of whether features are MCAR 
must be made from information about the data collection process. For 
categorical features, one way to diagnose this problem is to code “missing” 
as an additional class. Then we fit our model to the training data and see 
if class “missing” is predictive of the response. 
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Assuming the features are missing completely at random, there are a 
number of ways of proceeding: 

1. Discard observations with any missing values. 

2. Rely on the learning algorithm to deal with missing values in its 
training phase. 

3. Impute all missing values before training. 

Approach (1) can be used if the relative amount of missing data is small, 
but otherwise should be avoided. Regarding (2), CART is one learning 
algorithm that deals effectively with missing values, through surrogate splits 
(Section 9.2.4). MARS and PRIM use similar approaches. In generalized 
additive modeling, all observations missing for a given input feature are 
omitted when the partial residuals are smoothed against that feature in 
the backfitting algorithm, and their fitted values are set to zero. Since the 
fitted curves have mean zero (when the model includes an intercept), this 
amounts to assigning the average fitted value to the missing observations. 

For most learning methods, the imputation approach (3) is necessary. 
The simplest tactic is to impute the missing value with the mean or median 
of the nonmissing values for that feature. (Note that the above procedure 
for generalized additive models is analogous to this.) 

If the features have at least some moderate degree of dependence, one 
can do better by estimating a predictive model for each feature given the 
other features and then imputing each missing value by its prediction from 
the model. In choosing the learning method for imputation of the features, 
one must remember that this choice is distinct from the method used for 
predicting y from X. Thus a flexible, adaptive method will often be pre¬ 
ferred, even for the eventual purpose of carrying out a linear regression of y 
on X. In addition, if there are many missing feature values in the training 
set, the learning method must itself be able to deal with missing feature 
values. CART therefore is an ideal choice for this imputation “engine.” 

After imputation, missing values are typically treated as if they were ac¬ 
tually observed. This ignores the uncertainty due to the imputation, which 
will itself introduce additional uncertainty into estimates and predictions 
from the response model. One can measure this additional uncertainty by 
doing multiple imputations and hence creating many different training sets. 
The predictive model for y can be fit to each training set, and the variation 
across training sets can be assessed. If CART was used for the imputation 
engine, the multiple imputations could be done by sampling from the values 
in the corresponding terminal nodes. 
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9.7 Computational Considerations 

With N observations and p predictors, additive model fitting requires some 
number mp of applications of a one-dimensional smoother or regression 
method. The required number of cycles m of the backfitting algorithm is 
usually less than 20 and often less than 10, and depends on the amount 
of correlation in the inputs. With cubic smoothing splines, for example, 
N log N operations are needed for an initial sort and N operations for the 
spline fit. Hence the total operations for an additive model fit is pN log N + 
mpN. 

Trees require pN log N operations for an initial sort for each predictor, 
and typically another pN log N operations for the split computations. If the 
splits occurred near the edges of the predictor ranges, this number could 
increase to N 2 p. 

MARS requires Nm 2 + pmN operations to add a basis function to a 
model with m terms already present, from a pool of p predictors. Hence to 
build an M -term model requires NM 3 +pM 2 N computations, which can 
be quite prohibitive if M is a reasonable fraction of N. 

Each of the components of an HME are typically inexpensive to fit at 
each M-step: Np 2 for the regressions, and Np 2 K 2 for a IT-class logistic 
regression. The EM algorithm, however, can take a long time to converge, 
and so sizable HME models are considered costly to fit. 


Bibliographic Notes 

The most comprehensive source for generalized additive models is the text 
of that name by Hastie and Tibshirani (1990). Different applications of 
this work in medical problems are discussed in Hastie et al. (1989) and 
Hastie and Herman (1990), and the software implementation in Splus is 
described in Chambers and Hastie (1991). Green and Silverman (1994) 
discuss penalization and spline models in a variety of settings. Efron and 
Tibshirani (1991) give an exposition of modern developments in statistics 
(including generalized additive models), for a nonmathematical audience. 
Classification and regression trees date back at least as far as Morgan and 
Sonquist (1963). We have followed the modern approaches of Breiman et 
al. (1984) and Quinlan (1993). The PRIM method is due to Friedman 
and Fisher (1999), while MARS is introduced in Friedman (1991), with an 
additive precursor in Friedman and Silverman (1989). Hierarchical mixtures 
of experts were proposed in Jordan and Jacobs (1994); see also Jacobs et 
al. (1991). 
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Ex. 9.1 Show that a smoothing spline fit of yi to x t preserves the linear 
part of the fit. In other words, if yi = y, + r, where yi represents the 
linear regression fits, and S is the smoothing matrix, then Sy = y + Sr. 
Show that the same is true for local linear regression (Section 6.1.1). Hence 
argue that the adjustment step in the second line of (2) in Algorithm 9.1 
is unnecessary. 

Ex. 9.2 Let A be a known k x k matrix, b be a known fc-vector, and z 
be an unknown /e-vector. A Gauss-Seidel algorithm for solving the linear 
system of equations Az = b works by successively solving for element Zj in 
the jth equation, fixing all other Zj ’s at their current guesses. This process 
is repeated for j = 1,2,..., k, 1, 2,..., k,. . ,, until convergence (Golub and 
Van Loan, 1983). 

(a) Consider an additive model with N observations and p terms, with 
the jth term to be fit by a linear smoother Sj. Consider the following 
system of equations: 


/I Si Sx • 
S 2 I S 2 • 

■ S1 ^ 
• S 2 


f 2 


(Siy\ 

S 2 y 

\s p s p s p • 

• ij 


U J 


\s P y) 


Here each f, is an TV-vector of evaluations of the jth function at 
the data points, and y is an A-vector of the response values. Show 
that backfitting is a blockwise Gauss-Seidel algorithm for solving this 
system of equations. 

(b) Let Si and S2 be symmetric smoothing operators (matrices) with 
eigenvalues in [0,1). Consider a backfitting algorithm with response 
vector y and smoothers S!,S 2 . Show that with any starting values, 
the algorithm converges and give a formula for the final iterates. 

Ex. 9.3 Backfitting equations. Consider a backfitting procedure with orthog¬ 
onal projections, and let D be the overall regression matrix whose columns 
span V = £ co i(Si) ® £3 co i(S 2 ) ® ■ ■ ■ © £ C oi(S p ), where £ co i(S) denotes the 
column space of a matrix S. Show that the estimating equations 


(1 Si Si ■ 
S 2 I s 2 • 

■ SA 
• S 2 


f 2 


/ Si y \ 

S 2 y 

i^Sp s p s p ■ 

• ij 




v s P yy 


are equivalent to the least squares normal equations D T D f3 = D r y where 
/3 is the vector of coefficients. 
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Ex. 9.4 Suppose the same smoother S is used to estimate both terms in a 
two-term additive model (i.e., both variables are identical). Assume that S 
is symmetric with eigenvalues in [0,1). Show that the backfitting residual 
converges to (I + S) _1 (I — S)y, and that the residual sum of squares con¬ 
verges upward. Can the residual sum of squares converge upward in less 
structured situations? How does this fit compare to the fit with a single 
term fit by S? [Hint: Use the eigen-decomposition of S to help with this 
comparison.] 

Ex. 9.5 Degrees of freedom of a tree. Given data yi with mean f(xi) and 
variance er 2 , and a fitting operation y —> y, let’s define the degrees of 
freedom of a fit by cov(y l , yfj/a 2 . 

Consider a fit y estimated by a regression tree, fit to a set of predictors 
X 1 ,X 2 ,...,X p . 

(a) In terms of the number of terminal nodes m, give a rough formula for 

the degrees of freedom of the fit. 

(b) Generate 100 observations with predictors X lt X 2 ,..., X 10 as inde¬ 
pendent standard Gaussian variates and fix these values. 

(c) Generate response values also as standard Gaussian (cr 2 = 1), indepen¬ 

dent of the predictors. Fit regression trees to the data of fixed size 1,5 
and 10 terminal nodes and hence estimate the degrees of freedom of 
each fit. [Do ten simulations of the response and average the results, 
to get a good estimate of degrees of freedom.] 

(d) Compare your estimates of degrees of freedom in (a) and (c) and 
discuss. 

(e) If the regression tree fit were a linear operation, we could write y = Sy 

for some matrix S. Then the degrees of freedom would be tr(S). 
Suggest a way to compute an approximate S matrix for a regression 
tree, compute it and compare the resulting degrees of freedom to 
those in (a) and (c). 

Ex. 9.6 Consider the ozone data of Figure 6.9. 

(a) Fit an additive model to the cube root of ozone concentration, as a 

function of temperature, wind speed, and radiation. Compare your 
results to those obtained via the trellis display in Figure 6.9. 

(b) Fit trees, MARS, and PRIM to the same data, and compare the results 

to those found in (a) and in Figure 6.9. 
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10.1 Boosting Methods 

Boosting is one of the most powerful learning ideas introduced in the last 
twenty years. It was originally designed for classification problems, but as 
will be seen in this chapter, it can profitably be extended to regression 
as well. The motivation for boosting was a procedure that combines the 
outputs of many “weak” classifiers to produce a powerful “committee.” 
From this perspective boosting bears a resemblance to bagging and other 
committee-based approaches (Section 8.8). However we shall see that the 
connection is at best superficial and that boosting is fundamentally differ¬ 
ent. 

We begin by describing the most popular boosting algorithm due to 
Freund and Schapire (1997) called “AdaBoost.Ml.” Consider a two-class 
problem, with the output variable coded as Y € {—1,1}- Given a vector of 
predictor variables X, a classifier G(X) produces a prediction taking one 
of the two values {—1,1}. The error rate on the training sample is 

1 N 

err = 

V 2=1 

and the expected error rate on future predictions is E xyI(Y 7^ G(A')). 

A weak classifier is one whose error rate is only slightly better than 
random guessing. The purpose of boosting is to sequentially apply the 
weak classification algorithm to repeatedly modified versions of the data, 
thereby producing a sequence of weak classifiers G m (x), m = 1,2,..., M. 
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Final Classifier 


G(x) 


sign 

4 


Em=l“mG m (l) 


----- Gm{x) 



G 3 (x) 


G 2 (x) 


G\{x) 


FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted ver¬ 
sions of the dataset, and then combined to produce a final prediction. 


The predictions from all of them are then combined through a weighted 
majority vote to produce the final prediction: 


G{x) = sign 



( 10 . 1 ) 


Here aq, ■ ■ ■, «m are computed by the boosting algorithm, and weight 
the contribution of each respective G m (x). Their effect is to give higher 
influence to the more accurate classifiers in the sequence. Figure 10.1 shows 
a schematic of the AdaBoost procedure. 

The data modifications at each boosting step consist of applying weights 
w±,W 2 , • • •, wn to each of the training observations (x*, yi), i = 1,2,..., N. 
Initially all of the weights are set to Wi = 1 /TV, so that the first step simply 
trains the classifier on the data in the usual manner. For each successive 
iteration m = 2, 3,..., M the observation weights are individually modi¬ 
fied and the classification algorithm is reapplied to the weighted observa¬ 
tions. At step to, those observations that were misclassified by the classifier 
G m _i(x) induced at the previous step have their weights increased, whereas 
the weights are decreased for those that were classified correctly. Thus as 
iterations proceed, observations that are difficult to classify correctly re¬ 
ceive ever-increasing influence. Each successive classifier is thereby forced 




10.1 Boosting Methods 339 


Algorithm 10.1 AdaBoost.Ml. 


1. Initialize the observation weights wjj = 1/N, i = 1,2,..., N. 

2. For m = 1 to M: 


3. 


(a) Fit a classifier G m {x ) to the training data using weights iVj. 

(b) Compute 


Sill ± Gm(Xj)) 

v—riV 

£i=l W i 


err m = 


(c) Compute a m = log((l - err m )/err m ). 

(d) Set Wi 4 Wi • exp [a m ■ I{yi^ G m (x.i))], i = 1, 2 , ..., N. 


Output G(x) = sign Sm=i a m G m {x) 


to concentrate on those training observations that are missed by previous 
ones in the sequence. 

Algorithm 10.1 shows the details of the AdaBoost.Ml algorithm. The 
current classifier G m (x) is induced on the weighted observations at line 2a. 
The resulting weighted error rate is computed at line 2b. Line 2c calculates 
the weight a m given to G m (x) in producing the final classifier G(x ) (line 
3). The individual weights of each of the observations are updated for the 
next iteration at line 2d. Observations misclassified by G m (x ) have their 
weights scaled by a factor exp(ct m ), increasing their relative influence for 
inducing the next classifier G m _|_i(x) in the sequence. 

The AdaBoost.Ml algorithm is known as “Discrete AdaBoost” in Fried¬ 
man et al. (2000), because the base classifier G m (x) returns a discrete class 
label. If the base classifier instead returns a real-valued prediction (e.g., 
a probability mapped to the interval [—1,1]), AdaBoost can be modified 
appropriately (see “Real AdaBoost” in Friedman et al. (2000)). 

The power of AdaBoost to dramatically increase the performance of even 
a very weak classifier is illustrated in Figure 10.2. The features X±,..., Aio 
are standard independent Gaussian, and the deterministic target Y is de¬ 
fined by 


Y = 


1 *£5=1*'>X?o(0-5), 

— 1 otherwise. 


( 10 . 2 ) 


Here Xio(0.5) = 9.34 is the median of a chi-squared random variable with 
10 degrees of freedom (sum of squares of 10 standard Gaussians). There are 
2000 training cases, with approximately 1000 cases in each class, and 10,000 
test observations. Here the weak classifier is just a “stump”: a two terminal- 
node classification tree. Applying this classifier alone to the training data 
set yields a very poor test set error rate of 45.8%, compared to 50% for 
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FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps, 
as a function of the number of iterations. Also shown are the test error rate for 
a single stump, and a 244 -node classification tree. 

random guessing. However, as boosting iterations proceed the error rate 
steadily decreases, reaching 5.8% after 400 iterations. Thus, boosting this 
simple very weak classifier reduces its prediction error rate by almost a 
factor of four. It also outperforms a single large classification tree (error 
rate 24.7%). Since its introduction, much has been written to explain the 
success of AdaBoost in producing accurate classifiers. Most of this work 
has centered on using classification trees as the “base learner” G(x), where 
improvements are often most dramatic. In fact, Breiman (NIPS Workshop, 
1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in 
the world” (see also Breiman (1998)). This is especially the case for data- 
mining applications, as discussed more fully in Section 10.7 later in this 
chapter. 


10.1.1 Outline of This Chapter 

Here is an outline of the developments in this chapter: 

• We show that AdaBoost fits an additive model in a base learner, 
optimizing a novel exponential loss function. This loss function is 
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very similar to the (negative) binomial log-likelihood (Sections 10 . 2 - 
10.4). 

• The population minimizer of the exponential loss function is shown 
to be the log-odds of the class probabilities (Section 10.5). 

• We describe loss functions for regression and classification that are 
more robust than squared error or exponential loss (Section 10.6). 

• It is argued that decision trees are an ideal base learner for data 
mining applications of boosting (Sections 10.7 and 10.9). 

• We develop a class of gradient boosted models (GBMs), for boosting 
trees with any loss function (Section 10.10). 

• The importance of “slow learning” is emphasized, and implemented 
by shrinkage of each new term that enters the model (Section 10.12), 
as well as randomization (Section 10.12.2). 

• Tools for interpretation of the fitted model are described (Section 10.13). 

10.2 Boosting Fits an Additive Model 

The success of boosting is really not very mysterious. The key lies in ex¬ 
pression (10.1). Boosting is a way of fitting an additive expansion in a set 
of elementary “basis” functions. Here the basis functions are the individual 
classifiers G m (x ) G {—1,1}- More generally, basis function expansions take 
the form 

M 

/A) = PmKxWm), (10.3) 

m—1 

where /3 m , m = 1,2,... ,M are the expansion coefficients, and b(x; 7 ) G 1R 
are usually simple functions of the multivariate argument x, characterized 
by a set of parameters 7 . We discuss basis expansions in some detail in 
Chapter 5. 

Additive expansions like this are at the heart of many of the learning 
techniques covered in this book: 

• In single-hidden-layer neural networks (Chapter 11), 6 ( 1 ; 7 ) = < 7(70 + 
7 M 1 where a(t) = l/(l + e _t ) is the sigmoid function, and 7 param¬ 
eterizes a linear combination of the input variables. 

• In signal processing, wavelets (Section 5.9.1) are a popular choice with 
7 parameterizing the location and scale shifts of a “mother” wavelet. 

• Multivariate adaptive regression splines (Section 9.4) uses truncated- 
power spline basis functions where 7 parameterizes the variables and 
values for the knots. 
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Algorithm 10.2 Forward Stagewise Additive Modeling. 

1. Initialize fo(x) = 0. 

2. For to = 1 to M: 

(a) Compute 

N 

(0m, 7m) = argmin L(yi, f m -i(xi) + 0b(xi\ 7 )). 
/3,7 . 

2=1 

(b) Set f m (x) = fm-i(x) + 0 m b(x\7 m ). 


• For trees, 7 parameterizes the split variables and split points at the 
internal nodes, and the predictions at the terminal nodes. 

Typically these models are fit by minimizing a loss function averaged 
over the training data, such as the squared-error or a likelihood-based loss 
function, 

N / m \ 

min 0rrM.Xi\ 7m) . (10.4) 

For many loss functions L(y,f(x)) and/or basis functions b( a;; 7 ), this re¬ 
quires computationally intensive numerical optimization techniques. How¬ 
ever, a simple alternative often can be found when it is feasible to rapidly 
solve the subproblem of fitting just a single basis function, 

N 

min 'S2 L (y i ,0b(x i \7 )). (10.5) 

0,7 ' 


10.3 Forward Stagewise Additive Modeling 

Forward stagewise modeling approximates the solution to (10.4) by sequen¬ 
tially adding new basis functions to the expansion without adjusting the 
parameters and coefficients of those that have already been added. This is 
outlined in Algorithm 10 . 2 . At each iteration m, one solves for the optimal 
basis function 6 (x; y m ) and corresponding coefficient 0 m to add to the cur¬ 
rent expansion f m - i(x). This produces f m (x), and the process is repeated. 
Previously added terms are not modified. 

For squared-error loss 


L(y,f(x)) = (y- fix)) 2 , 


(10.6) 
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one has 

L{yhfm-i(xi)+Pb(x i -,'y)) = (yi - f m -i(xi) ~ Pb(xi\i )) 2 

= {ri m ~Pb(xi\7 )) 2 , (10.7) 

where Ti m = yi — f m -i{xi) is simply the residual of the current model 
on the ith observation. Thus, for squared-error loss, the term P m b{x\^ rn ) 
that best fits the current residuals is added to the expansion at each step. 
This idea is the basis for “least squares” regression boosting discussed in 
Section 10.10.2. However, as we show near the end of the next section, 
squared-error loss is generally not a good choice for classification; hence 
the need to consider other loss criteria. 


10.4 Exponential Loss and AdaBoost 

We now show that AdaBoost.Ml (Algorithm 10.1) is equivalent to forward 
stagewise additive modeling (Algorithm 10.2) using the loss function 

L{y,f{x))=exp(-yf(x)). ( 10 . 8 ) 

The appropriateness of this criterion is addressed in the next section. 

For AdaBoost the basis functions are the individual classifiers G m (x) £ 
{—1,1}. Using the exponential loss function, one must solve 

N 

(Pm,G m ) = arg min ^ exp[-^(/ TO _i (xj) + /3G(xi))] 

P,G . 

2=1 

for the classifier G m and corresponding coefficient /? m to be added at each 
step. This can be expressed as 


N 

( Pm,G m ) = arg rpin ^ exp(-/3 G(xj)) (10.9) 

2—1 

with = exp(—yif m -i{xi)). Since each depends neither on /3 

nor G(x ), it can be regarded as a weight that is applied to each observa¬ 
tion. This weight depends on f m -i(xi), and so the individual weight values 
change with each iteration to. 

The solution to (10.9) can be obtained in two steps. First, for any value 
of /? > 0, the solution to (10.9) for G m (x) is 

N 

G m = arg nun I{y, ^ Gph)), 

2=1 


( 10 . 10 ) 
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which is the classifier that minimizes the weighted error rate in predicting 
y. This can be easily seen by expressing the criterion in (10.9) as 

e p - w i +e zL w i ’ 

Vi=G(xi) Vi^G(xi) 

which in turn can be written as 

N N 

(e.P - e~P) • ^ G(xi)) + e - ' 9 • ^ (10.11) 

2=1 2=1 


Plugging this G m into (10.9) and solving for /3 one obtains 

0 1 , 1 - err m 

Pm = xlog-, 

2 err m 

where err m is the minimized weighted error rate 

E£i ^ m) I( yi ^G m ( Xi )) 


err m = 


v-^AT (m) 

Ei=i») 


( 10 . 12 ) 


(10.13) 


The approximation is then updated 

/m(^) = fm—l{x) T PmGm(.‘G)i 

which causes the weights for the next iteration to be 

w ( m +!) = w (m) . e -teG„ (l J 

Using the fact that —yiG m (xi) = 2 • I( yi ^ G m {xi )) — 1, (10.14) becomes 

= w f m > • . e _/3m , (10.15) 


(10.14) 


where a m = 2/3 m is the quantity defined at line 2c of AdaBoost.Ml (Al¬ 
gorithm 10.1). The factor e -/3m in (10.15) multiplies all weights by the 
same value, so it has no effect. Thus (10.15) is equivalent to line 2(d) of 
Algorithm 10.1. 

One can view line 2(a) of the Adaboost.Ml algorithm as a method for 
approximately solving the minimization in (10.11) and hence (10.10). Hence 
we conclude that AdaBoost.Ml minimizes the exponential loss criterion 
(10.8) via a forward-stagewise additive modeling approach. 

Figure 10.3 shows the training-set misclassification error rate and aver¬ 
age exponential loss for the simulated data problem (10.2) of Figure 10.2. 
The training-set misclassification error decreases to zero at around 250 it¬ 
erations (and remains there), but the exponential loss keeps decreasing. 
Notice also in Figure 10.2 that the test-set misclassification error continues 
to improve after iteration 250. Clearly Adaboost is not optimizing training- 
set misclassification error; the exponential loss is more sensitive to changes 
in the estimated class probabilities. 
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Boosting Iterations 


FIGURE 10.3. Simulated data, boosting with stumps: misclassification error 
rate on the training set, and average exponential loss: (1/7V) exp (-yif(xi)). 

After about 250 iterations, the misclassification error is zero, while the exponential 
loss continues to decrease. 


10.5 Why Exponential Loss? 

The AdaBoost.Ml algorithm was originally motivated from a very differ¬ 
ent perspective than presented in the previous section. Its equivalence to 
forward stagewise additive modeling based on exponential loss was only 
discovered five years after its inception. By studying the properties of the 
exponential loss criterion, one can gain insight into the procedure and dis¬ 
cover ways it might be improved. 

The principal attraction of exponential loss in the context of additive 
modeling is computational; it leads to the simple modular reweighting Ad- 
aBoost algorithm. However, it is of interest to inquire about its statistical 
properties. What does it estimate and how well is it being estimated? The 
first question is answered by seeking its population minimizer. 

It is easy to show (Friedman et al., 2000) that 


f*{x) = argminEy| x (e y/(a:) ) 
/(*) 


1 Pr(F = lp) 
2 S Pr(y = -lp)’ 


(10.16) 
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or equivalently 


Pr(Y = l|a;) 


1 

1 + e~ 2 /*( x ) ‘ 


Thus, the additive expansion produced by AdaBoost is estimating one- 
half the log-odds of P(Y = l|x). This justifies using its sign as the classifi¬ 
cation rule in (10.1). 

Another loss criterion with the same population minimizer is the bi¬ 
nomial negative log-likelihood or deviance (also known as cross-entropy), 
interpreting / as the logit transform. Let 


p(x) = Pr(Y = 11 x) 


e f (*) 

e~f( x ) + eAU 


1 

1 + e~ 2 f( x ) 


(10.17) 


and define Y' = (Y + l)/2 £ {0,1}. Then the binomial log-likelihood loss 
function is 


l(Y,p(x)) = Y' log p(x) + (1 - Y') log(l - p(x)), 

or equivalently the deviance is 

—l(Y, f(x)) = log (l + e - 2y/(x) ) . (10.18) 

Since the population maximizer of log-likelihood is at the true probabilities 
p(x ) = Pr(Y = 11 x ), we see from (10.17) that the population minimizers of 
the deviance Eyi x [—l(Y, f(x))] and F, Y \x[ e ~ Y ^ x ' 1 } are the same. Thus, using 
either criterion leads to the same solution at the population level. Note that 
f itself is not a proper log-likelihood, since it is not the logarithm of 
any probability mass function for a binary random variable Y £ {—1, !}• 


10.6 Loss Functions and Robustness 

In this section we examine the different loss functions for classification and 
regression more closely, and characterize them in terms of their robustness 
to extreme data. 

Robust Loss Functions for Classification 

Although both the exponential (10.8) and binomial deviance (10.18) yield 
the same solution when applied to the population joint distribution, the 
same is not true for finite data sets. Both criteria are monotone decreasing 
functions of the “margin” yf{x). In classification (with a —1/1 response) 
the margin plays a role analogous to the residuals y—f(x) in regression. The 
classification rule G( x) = sign[/(x)] implies that observations with positive 
margin yif(xi) > 0 are classified correctly whereas those with negative 
margin yif(xi) < 0 are misclassified. The decision boundary is defined by 
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yf 


FIGURE 10.4. Loss functions for two-class classification. The response is 
y = ±1; the prediction is f, with class prediction sign(/). The losses are 
misclassification: I(sign(/) ^ y ); exponential: exp (—yf); binomial deviance: 
log(l + exp(— 2yf)); squared error: (y — f) 2 ; and support vector: (1 — yf)+ (see 
Section 12.3). Each function has been scaled so that it passes through the point 
( 0 , 1 ). 

f(x) = 0. The goal of the classification algorithm is to produce positive 
margins as frequently as possible. Any loss criterion used for classification 
should penalize negative margins more heavily than positive ones since 
positive margin observations are already correctly classified. 

Figure 10.4 shows both the exponential (10.8) and binomial deviance 
criteria as a function of the margin y ■ /(x). Also shown is misclassification 
loss L(y, /(x)) = I(y-f(x) < 0), which gives unit penalty for negative mar¬ 
gin values, and no penalty at all for positive ones. Both the exponential 
and deviance loss can be viewed as monotone continuous approximations 
to misclassification loss. They continuously penalize increasingly negative 
margin values more heavily than they reward increasingly positive ones. 
The difference between them is in degree. The penalty associated with bi¬ 
nomial deviance increases linearly for large increasingly negative margin, 
whereas the exponential criterion increases the influence of such observa¬ 
tions exponentially. 

At any point in the training process the exponential criterion concen¬ 
trates much more influence on observations with large negative margins. 
Binomial deviance concentrates relatively less influence on such observa- 
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tions, more evenly spreading the influence among all of the data. It is 
therefore far more robust in noisy settings where the Bayes error rate is 
not close to zero, and especially in situations where there is misspecification 
of the class labels in the training data. The performance of AdaBoost has 
been empirically observed to dramatically degrade in such situations. 

Also shown in the figure is squared-error loss. The minimizer of the cor¬ 
responding risk on the population is 


f*(x) = arg min E Y \ x (X—f (x)) 2 = E(V | x) = 2-Pr(T = 11 x)-\. (10.19) 


/(*) 


As before the classification rule is G(x ) = sign[/(a;)]. Squared-error loss 
is not a good surrogate for misclassification error. As seen in Figure 10.4, it 
is not a monotone decreasing function of increasing margin yf(x). For mar¬ 
gin values yif(xi) > 1 it increases quadratically, thereby placing increasing 
influence (error) on observations that are correctly classified with increas¬ 
ing certainty, thereby reducing the relative influence of those incorrectly 
classified yif(xi) < 0. Thus, if class assignment is the goal, a monotone de¬ 
creasing criterion serves as a better surrogate loss function. Figure 12.4 on 
page 426 in Chapter 12 includes a modification of quadratic loss, the “Hu- 
berized” square hinge loss (Rosset et al., 2004b), which enjoys the favorable 
properties of the binomial deviance, quadratic loss and the SVM hinge loss. 
It has the same population minimizer as the quadratic (10.19), is zero for 
y-f(x) > 1, and becomes linear for y-f(x) < —1. Since quadratic functions 
are easier to compute with than exponentials, our experience suggests this 
to be a useful alternative to the binomial deviance. 

With A'-class classification, the response Y takes values in the unordered 
set Q = {Qi ,..., Q k } (see Sections 2.4 and 4.4). We now seek a classifier 
G(x ) taking values in Q. It is sufficient to know the class conditional proba¬ 
bilities Pk{x) = Pr(F = Qk\x), k = 1, 2,..., K, for then the Bayes classifier 
is 



( 10 . 20 ) 


In principal, though, we need not learn the p k (x ), but simply which one is 
largest. However, in data mining applications the interest is often more in 
the class probabilities pe(x), l = 1,..., K themselves, rather than in per¬ 
forming a class assignment. As in Section 4.4, the logistic model generalizes 
naturally to K classes, 


fi. ( rr'l 



( 10 . 21 ) 


which ensures that 0 < p k {x ) < 1 and that they sum to one. Note that 
here we have K different functions, one per class. There is a redundancy 
in the functions f k (x ), since adding an arbitrary h{x) to each leaves the 
model unchanged. Traditionally one of them is set to zero: for example, 
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/k{x) = 0, as in (4.17). Here we prefer to retain the symmetry, and impose 
the constraint Y2k= i fk( x ) = 0- The binomial deviance extends naturally 
to the K- class multinomial deviance loss function: 

K 

L (y,p(x)) = = Gk)\ogp k {x) 

k= 1 

K / K 

= = Qk)fk(x) + io g ( J2 eMx) 

fe=i v^=i 

As in the two-class case, the criterion (10.22) penalizes incorrect predictions 
only linearly in their degree of incorrectness. 

Zhu et al. (2005) generalize the exponential loss for if-class problems. 
See Exercise 10.5 for details. 


. ( 10 . 22 ) 


Robust Loss Functions for Regression 

In the regression setting, analogous to the relationship between exponential 
loss and binomial log-likelihood is the relationship between squared-error 
loss L{y , f(x)) = (y — f(x)) 2 and absolute loss L(y , f(x)) = | y — f(x) |. The 
population solutions are f(x) = E(Y|a;) for squared-error loss, and f(x) = 
median(Y|a;) for absolute loss; for symmetric error distributions these are 
the same. However, on finite samples squared-error loss places much more 
emphasis on observations with large absolute residuals | y% — f(xi) | during 
the fitting process. It is thus far less robust, and its performance severely 
degrades for long-tailed error distributions and especially for grossly mis- 
measured y- values (“outliers”). Other more robust criteria, such as abso¬ 
lute loss, perform much better in these situations. In the statistical ro¬ 
bustness literature, a variety of regression loss criteria have been proposed 
that provide strong resistance (if not absolute immunity) to gross outliers 
while being nearly as efficient as least squares for Gaussian errors. They 
are often better than either for error distributions with moderately heavy 
tails. One such criterion is the Huber loss criterion used for M-regression 
(Huber, 1964) 


L{y,f{x)) 


[y ~ f( x )? for | y - f(x) | < <5, 
28\ y — f(x) | — 5 2 otherwise. 


(10.23) 


Figure 10.5 compares these three loss functions. 

These considerations suggest that when robustness is a concern, as is 
especially the case in data mining applications (see Section 10.7), squared- 
error loss for regression and exponential loss for classification are not the 
best criteria from a statistical perspective. However, they both lead to the 
elegant modular boosting algorithms in the context of forward stagewise 
additive modeling. For squared-error loss one simply fits the base learner 
to the residuals from the current model y,; — fm-i{xi ) at each step. For 
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y-f 


FIGURE 10.5. A comparison of three loss functions for regression, plotted as a 
function of the margin y—f. The Huber loss function combines the good properties 
of squared-error loss near zero and absolute error loss when \y — f\ is large. 


exponential loss one performs a weighted fit of the base learner to the 
output values y,, with weights Wi = exp(— yif m -i(xi)). Using other more 
robust criteria directly in their place does not give rise to such simple 
feasible boosting algorithms. However, in Section 10.10.2 we show how one 
can derive simple elegant boosting algorithms based on any differentiable 
loss criterion, thereby producing highly robust boosting procedures for data 
mining. 


10.7 “Off-the-Shelf” Procedures for Data Mining 

Predictive learning is an important aspect of data mining. As can be seen 
from this book, a wide variety of methods have been developed for predic¬ 
tive learning from data. For each particular method there are situations 
for which it is particularly well suited, and others where it performs badly 
compared to the best that can be done with that data. We have attempted 
to characterize appropriate situations in our discussions of each of the re¬ 
spective methods. However, it is seldom known in advance which procedure 
will perform best or even well for any given problem. Table 10.1 summarizes 
some of the characteristics of a number of learning methods. 

Industrial and commercial data mining applications tend to be especially 
challenging in terms of the requirements placed on learning procedures. 
Data sets are often very large in terms of number of observations and 
number of variables measured on each of them. Thus, computational con- 
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TABLE 10.1. Some characteristics of different learning methods. Key: A= good, 
+ =fair, and ▼ =poor. 


Characteristic 

Neural 

Nets 

SVM 

Trees 

MARS 

k-NN, 

Kernels 

Natural handling of data 
of “mixed” type 

T 

T 

▲ 

▲ 

▼ 

Handling of missing values 

T 

▼ 

▲ 

▲ 

A 

Robustness to outliers in 
input space 

T 

▼ 

▲ 

▼ 

A 

Insensitive to monotone 
transformations of inputs 

T 

T 

▲ 

▼ 

▼ 

Computational scalability 
(large N ) 

T 

▼ 

▲ 

▲ 

▼ 

Ability to deal with irrel¬ 
evant inputs 

T 

▼ 

▲ 

▲ 

▼ 

Ability to extract linear 
combinations of features 

▲ 

▲ 

T 

▼ 

♦ 

Interpretability 

T 

▼ 



▼ 

Predictive power 



▼ 




siderations play an important role. Also, the data are usually messy: the 
inputs tend to be mixtures of quantitative, binary, and categorical vari¬ 
ables, the latter often with many levels. There are generally many missing 
values, complete observations being rare. Distributions of numeric predic¬ 
tor and response variables are often long-tailed and highly skewed. This 
is the case for the spam data (Section 9.1.2); when fitting a generalized 
additive model, we first log-transformed each of the predictors in order to 
get a reasonable fit. In addition they usually contain a substantial fraction 
of gross mis-measurements (outliers). The predictor variables are generally 
measured on very different scales. 

In data mining applications, usually only a small fraction of the large 
number of predictor variables that have been included in the analysis are 
actually relevant to prediction. Also, unlike many applications such as pat¬ 
tern recognition, there is seldom reliable domain knowledge to help create 
especially relevant features and/or filter out the irrelevant ones, the inclu¬ 
sion of which dramatically degrades the performance of many methods. 

In addition, data mining applications generally require interpretable mod¬ 
els. It is not enough to simply produce predictions. It is also desirable to 
have information providing qualitative understanding of the relationship 














352 


10. Boosting and Additive Trees 


between joint values of the input variables and the resulting predicted re¬ 
sponse value. Thus, black box methods such as neural networks, which can 
be quite useful in purely predictive settings such as pattern recognition, 
are far less useful for data mining. 

These requirements of speed, interpretability and the messy nature of 
the data sharply limit the usefulness of most learning procedures as off- 
the-shelf methods for data mining. An “off-the-shelf” method is one that 
can be directly applied to the data without requiring a great deal of time- 
consuming data preprocessing or careful tuning of the learning procedure. 

Of all the well-known learning methods, decision trees come closest to 
meeting the requirements for serving as an off-the-shelf procedure for data 
mining. They are relatively fast to construct and they produce interpretable 
models (if the trees are small). As discussed in Section 9.2, they naturally 
incorporate mixtures of numeric and categorical predictor variables and 
missing values. They are invariant under (strictly monotone) transforma¬ 
tions of the individual predictors. As a result, scaling and/or more general 
transformations are not an issue, and they are immune to the effects of pre¬ 
dictor outliers. They perform internal feature selection as an integral part 
of the procedure. They are thereby resistant, if not completely immune, 
to the inclusion of many irrelevant predictor variables. These properties of 
decision trees are largely the reason that they have emerged as the most 
popular learning method for data mining. 

Trees have one aspect that prevents them from being the ideal tool for 
predictive learning, namely inaccuracy. They seldom provide predictive ac¬ 
curacy comparable to the best that can be achieved with the data at hand. 
As seen in Section 10.1, boosting decision trees improves their accuracy, 
often dramatically. At the same time it maintains most of their desirable 
properties for data mining. Some advantages of trees that are sacrificed by 
boosting are speed, interpretability, and, for AdaBoost, robustness against 
overlapping class distributions and especially mislabeling of the training 
data. A gradient boosted model (GBM) is a generalization of tree boosting 
that attempts to mitigate these problems, so as to produce an accurate and 
effective off-the-shelf procedure for data mining. 


10.8 Example: Spam Data 

Before we go into the details of gradient boosting, we demonstrate its abili¬ 
ties on a two-class classification problem. The spam data are introduced in 
Chapter 1, and used as an example for many of the procedures in Chapter 9 
(Sections 9.1.2, 9.2.5, 9.3.1 and 9.4.1). 

Applying gradient boosting to these data resulted in a test error rate of 
4.5%, using the same test set as was used in Section 9.1.2. By comparison, 
an additive logistic regression achieved 5.5%, a CART tree fully grown and 
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pruned by cross-validation 8.7%, and MARS 5.5%. The standard error of 
these estimates is around 0 . 6 %, although gradient boosting is significantly 
better than all of them using the McNemar test (Exercise 10.6). 

In Section 10.13 below we develop a relative importance measure for 
each predictor, as well as a partial dependence plot describing a predictor’s 
contribution to the fitted model. We now illustrate these for the spam data. 

Figure 10.6 displays the relative importance spectrum for all 57 predictor 
variables. Clearly some predictors are more important than others in sep¬ 
arating spam from email. The frequencies of the character strings !, $, hp, 
and remove are estimated to be the four most relevant predictor variables. 
At the other end of the spectrum, the character strings 857, 415, table, and 
3d have virtually no relevance. 

The quantity being modeled here is the log-odds of spam versus email 


f{x) = log 


Pr(spam|a;) 

Pr(email|cc) 


(10.24) 


(see Section 10.13 below). Figure 10.7 shows the partial dependence of the 
log-odds on selected important predictors, two positively associated with 
spam (! and remove), and two negatively associated (edu and hp). These 
particular dependencies are seen to be essentially monotonic. There is a 
general agreement with the corresponding functions found by the additive 
logistic regression model; see Figure 9.1 on page 303. 

Running a gradient boosted model on these data with J = 2 terminal- 
node trees produces a purely additive (main effects) model for the log- 
odds, with a corresponding error rate of 4.7%, as compared to 4.5% for the 
full gradient boosted model (with J = 5 terminal-node trees). Although 
not significant, this slightly higher error rate suggests that there may be 
interactions among some of the important predictor variables. This can 
be diagnosed through two-variable partial dependence plots. Figure 10.8 
shows one of the several such plots displaying strong interaction effects. 

One sees that for very low frequencies of hp, the log-odds of spam are 
greatly increased. For high frequencies of hp, the log-odds of spam tend to 
be much lower and roughly constant as a function of !. As the frequency 
of hp decreases, the functional relationship with ! strengthens. 


10.9 Boosting Trees 

Regression and classification trees are discussed in detail in Section 9.2. 
They partition the space of all joint predictor variable values into disjoint 
regions Rj,j = 1, 2,..., J, as represented by the terminal nodes of the tree. 
A constant 7 j is assigned to each such region and the predictive rule is 


x £ Rj =r> f (x) — 7 i ■ 
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FIGURE 10.6. Predictor variable importance spectrum for the spam data. The 
variable names are written on the vertical axis. 
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FIGURE 10.7. Partial dependence of log-odds of spam on four important pre¬ 
dictors. The red ticks at the base of the plots are deciles of the input variable. 



FIGURE 10.8. Partial dependence of the log-odds of spam vs. email as a func¬ 
tion of joint frequencies of hp and the character !. 
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Thus a tree can be formally expressed as 

J 

T{x\ 0) = £ Rj), (10.25) 

3= i 

with parameters 0 = {Rj, 7 j}i- J is usually treated as a meta-parameter. 
The parameters are found by minimizing the empirical risk 

j 

0 = argimn^ ^ L(y h 7 ,). (10.26) 

3=1 XiERj 

This is a formidable combinatorial optimization problem, and we usually 
settle for approximate suboptimal solutions. It is useful to divide the opti¬ 
mization problem into two parts: 

Finding 7 j given Rj: Given the Rj, estimating the 7 j is typically trivial, 
and often 7 j = yj , the mean of the yi falling in region Rj. For mis- 
classification loss, 77 is the modal class of the observations falling in 
region Rj. 

Finding Rj . This is the difficult part, for which approximate solutions are 
found. Note also that finding the Rj entails estimating the 7 j as well. 
A typical strategy is to use a greedy, top-down recursive partitioning 
algorithm to find the Rj. In addition, it is sometimes necessary to 
approximate (10.26) by a smoother and more convenient criterion for 
optimizing the Rj\ 

N 

0 = argimn^L(y i ,T(a; i ,0)). (10.27) 

i= 1 

Then given the Rj = Rj, the 7 j can be estimated more precisely 
using the original criterion. 

In Section 9.2 we described such a strategy for classification trees. The Gini 
index replaced misclassification loss in the growing of the tree (identifying 
the Rj). 

The boosted tree model is a sum of such trees, 

M 

(10.28) 

m= 1 

induced in a forward stagewise manner (Algorithm 10.2). At each step in 
the forward stagewise procedure one must solve 

N 

0 m = argrnin^L(y i ,/ m _ 1 (a; i ) +T(xj; 0 m )) 

m i= 1 


(10.29) 
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for the region set and constants 0 m = {Rjmiljm}^ of the next tree, given 
the current model / m _ i(x). 

Given the regions Rjm, finding the optimal constants 7 j m in each region 
is typically straightforward: 

7 jm = argmin ^ L (y*, f m -i{xi) + 7 jm) ■ (10.30) 

7jm Xi&Rjm 

Finding the regions is difficult, and even more difficult than for a single 
tree. For a few special cases, the problem simplifies. 

For squared-error loss, the solution to (10.29) is no harder than for a 
single tree. It is simply the regression tree that best predicts the current 
residuals y 4 — f m — 1 ( 27 ), and 7 jm is the mean of these residuals in each 
corresponding region. 

For two-class classification and exponential loss, this stagewise approach 
gives rise to the AdaBoost method for boosting classification trees (Algo¬ 
rithm 10.1). In particular, if the trees T(x ; 0 m ) are restricted to be scaled 
classification trees, then we showed in Section 10.4 that the solution to 
(10.29) is the tree that minimizes the weighted error rate J2iLi 1(;yi ^ 
T(xi\Q m )) with weights w By a scaled classification 
tree, we mean f3 m T(x; 0 m ), with the restriction that 7 j m € {—1,1}). 

Without this restriction, (10.29) still simplifies for exponential loss to a 
weighted exponential criterion for the new tree: 

N 

0 m = argmin^toj m) exp[-y i T(x i ;0„ l )]. (10.31) 

m i=l 


It is straightforward to implement a greedy recursive-partitioning algorithm 
using this weighted exponential loss as a splitting criterion. Given the Rj m , 
one can show (Exercise 10.7) that the solution to (10.30) is the weighted 
log-odds in each corresponding region 



E Xi eR jm ^ m) ^ = i) 

^eRim w i m) Hyi = - 1 ) 


(10.32) 


This requires a specialized tree-growing algorithm; in practice, we prefer 
the approximation presented below that uses a weighted least squares re¬ 
gression tree. 

Using loss criteria such as the absolute error or the Huber loss (10.23) in 
place of squared-error loss for regression, and the deviance (10.22) in place 
of exponential loss for classification, will serve to robustify boosting trees. 
Unfortunately, unlike their nonrobust counterparts, these robust criteria 
do not give rise to simple fast boosting algorithms. 

For more general loss criteria the solution to (10.30), given the Rj m , 
is typically straightforward since it is a simple “location” estimate. For 
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absolute loss it is just the median of the residuals in each respective region. 
For the other criteria fast iterative algorithms exist for solving (10.30), 
and usually their faster “single-step” approximations are adequate. The 
problem is tree induction. Simple fast algorithms do not exist for solving 
(10.29) for these more general loss criteria, and approximations like (10.27) 
become essential. 


10.10 Numerical Optimization via Gradient 
Boosting 

Fast approximate algorithms for solving (10.29) with any differentiable loss 
criterion can be derived by analogy to numerical optimization. The loss in 
using f(x) to predict y on the training data is 

N 

L(f) = '£,L{y i ,f(x i )). (10-33) 

i=l 

The goal is to minimize L(/) with respect to /, where here f[x ) is con¬ 
strained to be a sum of trees (10.28). Ignoring this constraint, minimizing 
(10.33) can be viewed as a numerical optimization 

f = arg inin L(f). (10.34) 

where the “parameters” f £ IR W are the values of the approximating func¬ 
tion f(xi) at each of the N data points xf. 

f = l/Gi), f(x 2 ), ■ ■ ■, f(x N )} T . 

Numerical optimization procedures solve (10.34) as a sum of component 
vectors 

M 

fjw = h m , h„, £ JR V , 

m=0 

where fo = ho is an initial guess, and each successive f m is induced based 
on the current parameter vector f m _i, which is the sum of the previously 
induced updates. Numerical optimization methods differ in their prescrip¬ 
tions for computing each increment vector h m (“step”). 


10.10.1 Steepest Descent 

Steepest descent chooses h m = — p m g m where p m is a scalar and g m £ ffG 
is the gradient of L( f) evaluated at f = f m _i. The components of the 
gradient g m are 


9im 


dL{y l J(x i )) 


df{xi) 




(10.35) 
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The step length p m is the solution to 

p m = argminL(f m _i - pg m ). (10.36) 

p 

The current solution is then updated 

f'm — l PmSm 

and the process repeated at the next iteration. Steepest descent can be 
viewed as a very greedy strategy, since — g m is the local direction in IR^ 
for which L( f) is most rapidly decreasing at f = f m _i. 

10.10.2 Gradient Boosting 

Forward stagewise boosting (Algorithm 10.2) is also a very greedy strategy. 
At each step the solution tree is the one that maximally reduces (10.29), 
given the current model f m - 1 and its fits Thus, the tree predic¬ 

tions T(xi\ 0 m ) are analogous to the components of the negative gradient 
(10.35). The principal difference between them is that the tree components 
t m = {T(x\, 0 m ),... ,T(xn', 0 m )} T are not independent. They are con¬ 
strained to be the predictions of a J TO -terminal node decision tree, whereas 
the negative gradient is the unconstrained maximal descent direction. 

The solution to (10.30) in the stagewise approach is analogous to the line 
search (10.36) in steepest descent. The difference is that (10.30) performs 
a separate line search for those components of t m that correspond to each 
separate terminal region {T(xi;Q m )} Xie n jnl . 

If minimizing loss on the training data (10.33) were the only goal, steep¬ 
est descent would be the preferred strategy. The gradient (10.35) is trivial 
to calculate for any differentiable loss function L(y, /( x)), whereas solving 
(10.29) is difficult for the robust criteria discussed in Section 10.6. Unfor¬ 
tunately the gradient (10.35) is defined only at the training data points Xi , 
whereas the ultimate goal is to generalize / m{x ) to new data not repre¬ 
sented in the training set. 

A possible resolution to this dilemma is to induce a tree T(x; 0 m ) at the 
mth iteration whose predictions t m are as close as possible to the negative 
gradient. Using squared error to measure closeness, this leads us to 

N 

9m = argnnn^ {-g im - T(xp 0)) 2 . (10.37) 

i-1 

That is, one fits the tree T to the negative gradient values (10.35) by least 
squares. As noted in Section 10.9 fast algorithms exist for least squares 
decision tree induction. Although the solution regions Rj m to (10.37) will 
not be identical to the regions Rjm that solve (10.29), it is generally sim¬ 
ilar enough to serve the same purpose. In any case, the forward stagewise 
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TABLE 10.2. Gradients for commonly used loss functions. 


Setting 

Loss Function 

-dL(yi,f(xi))/df(xi) 

Regression 

hiVi ~ f(xi)} 2 

Vi ~ f(xi) 

Regression 

1 Vi - f{xi )| 

sign[j/i - f{xi)] 

Regression 

Huber 

Vi - f{xi ) for \yi - f(xi)\ < S m 
<5 m sign[yi - f(xi)} for \y t - f(xi)\ > S m 
where S m = ath-quantile{|yi — 

Classification 

Deviance 

kth component: I(yi = Gk) — Pk{xi) 


boosting procedure, and top-down decision tree induction, are themselves 
approximation procedures. After constructing the tree (10.37), the corre¬ 
sponding constants in each region are given by (10.30). 

Table 10.2 summarizes the gradients for commonly used loss functions. 
For squared error loss, the negative gradient is just the ordinary residual 
— gi m = yi — so that (10.37) on its own is equivalent to standard 

least-squares boosting. With absolute error loss, the negative gradient is 
the sign of the residual, so at each iteration (10.37) fits the tree to the 
sign of the current residuals by least squares. For Huber M-regression, the 
negative gradient is a compromise between these two (see the table). 

For classification the loss function is the multinomial deviance (10.22), 
and K least squares trees are constructed at each iteration. Each tree li- m 
is fit to its respective negative gradient vector g km , 


9ikm 


&L (jji-, • • • j /lm(*^i)) 

& fkmipti) 

l(yi = Gk) -Pk(xi), 


(10.38) 


with Pk(x) given by (10.21). Although K separate trees are built at each 
iteration, they are related through (10.21). For binary classification (K = 
2 ), only one tree is needed (exercise 10.10). 


10.10.3 Implementations of Gradient Boosting 

Algorithm 10.3 presents the generic gradient tree-boosting algorithm for 
regression. Specific algorithms are obtained by inserting different loss cri¬ 
teria L(y,f{x)). The first line of the algorithm initializes to the optimal 
constant model, which is just a single terminal node tree. The components 
of the negative gradient computed at line 2(a) are referred to as general¬ 
ized or pseudo residuals, r. Gradients for commonly used loss functions are 
summarized in Table 10.2. 
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Algorithm 10.3 Gradient Tree Boosting Algorithm. 


1. Initialize f 0 (x) = argmin 7 HVii 7)- 

2. For m = 1 to M: 

(a) For i = 1, 2,..., TV compute 

~dL( yi ,f( Xi)) 


Tim. — 


df(xi) 


-I f=f m -i 


(b) Fit a regression tree to the targets ri m giving terminal regions 
Rj mi j — 

(c) For j = 1,2,..., J m compute 

7 im=argmin V' L (y u f m -i(xi) + 7) • 

7 z ' 

XiGRjm 


(d) Update f m ( x) = f m -i{x) + J2j=i 7jmJ(x G 

3. Output /(a;) = /mR). 


The algorithm for classification is similar. Lines 2(a)-(d) are repeated 
K times at each iteration m, once for each class using (10.38). The result 
at line 3 is A different (coupled) tree expansions fkM(x), k = 1,2 ,,K. 
These produce probabilities via (10.21) or do classification as in (10.20). 
Details are given in Exercise 10.9. Two basic tuning parameters are the 
number of iterations M and the sizes of each of the constituent trees 
Jm, m= 1,2, 

The original implementation of this algorithm was called MART for 
“multiple additive regression trees,” and was referred to in the first edi¬ 
tion of this book. Many of the figures in this chapter were produced by 
MART. Gradient boosting as described here is implemented in the R gbm 
package (Ridgeway, 1999, “Gradient Boosted Models”), and is freely avail¬ 
able. The gbm package is used in Section 10.14.2, and extensively in Chap¬ 
ters 16 and 15. Another R implementation of boosting is mboost (Hothorn 
and Biihlmann, 2006). A commercial implementation of gradient boost- 
ing/MART called TreeNet® is available from Salford Systems, Inc. 


10.11 Right-Sized Trees for Boosting 

Historically, boosting was considered to be a technique for combining mod¬ 
els, here trees. As such, the tree building algorithm was regarded as a 
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primitive that produced models to be combined by the boosting proce¬ 
dure. In this scenario, the optimal size of each tree is estimated separately 
in the usual manner when it is built (Section 9.2). A very large (oversized) 
tree is first induced, and then a bottom-up procedure is employed to prune 
it to the estimated optimal number of terminal nodes. This approach as¬ 
sumes implicitly that each tree is the last one in the expansion (10.28). 
Except perhaps for the very last tree, this is clearly a very poor assump¬ 
tion. The result is that trees tend to be much too large, especially during 
the early iterations. This substantially degrades performance and increases 
computation. 

The simplest strategy for avoiding this problem is to restrict all trees 
to be the same size, J m = J Vm. At each iteration a J-terminal node 
regression tree is induced. Thus J becomes a meta-parameter of the entire 
boosting procedure, to be adjusted to maximize estimated performance for 
the data at hand. 

One can get an idea of useful values for J by considering the properties 
of the “target” function 

V = argminEjfyI(F, f(X)). (10.39) 

Here the expected value is over the population joint distribution of (A, Y). 
The target function tj(x) is the one with minimum prediction risk on future 
data. This is the function we are trying to approximate. 

One relevant property of 77 (A) is the degree to which the coordinate vari¬ 
ables X T = (X\,X 2 , ■ ■ ■ ,X P ) interact with one another. This is captured 
by its ANOVA (analysis of variance) expansion 

v(x) = E^)+E Vjk (Aj, Xk) -}- 'y ( Vjki {Xj , Xfc , X [) H . (10.40) 

3 jk jkl 

The first sum in (10.40) is over functions of only a single predictor variable 
Xj. The particular functions rij(Xj) are those that jointly best approximate 
77 (A) under the loss criterion being used. Each such rjj(Xj) is called the 
“main effect” of Xj. The second sum is over those two-variable functions 
that when added to the main effects best fit 77 (A). These are called the 
second-order interactions of each respective variable pair (Aj,Afc). The 
third sum represents third-order interactions, and so on. For many problems 
encountered in practice, low-order interaction effects tend to dominate. 
When this is the case, models that produce strong higher-order interaction 
effects, such as large decision trees, suffer in accuracy. 

The interaction level of tree-based approximations is limited by the tree 
size J. Namely, no interaction effects of level greater than J — 1 are pos¬ 
sible. Since boosted models are additive in the trees (10.28), this limit 
extends to them as well. Setting J = 2 (single split “decision stump”) 
produces boosted models with only main effects; no interactions are per¬ 
mitted. With J = 3, two-variable interaction effects are also allowed, and 
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Number of Terms 

FIGURE 10.9. Boosting with different sized trees, applied to the example (10.2) 
used in Figure 10.2. Since the generative model is additive, stumps perform the 
best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; 
shown for comparison is the AdaBoost Algorithm 10.1. 


so on. This suggests that the value chosen for J should reflect the level 
of dominant interactions of rj(x). This is of course generally unknown, but 
in most situations it will tend to be low. Figure 10.9 illustrates the effect 
of interaction order (choice of J) on the simulation example (10.2). The 
generative function is additive (sum of quadratic monomials), so boosting 
models with J > 2 incurs unnecessary variance and hence the higher test 
error. Figure 10.10 compares the coordinate functions found by boosted 
stumps with the true functions. 

Although in many applications J = 2 will be insufficient, it is unlikely 
that J > 10 will be required. Experience so far indicates that 4 < J < 8 
works well in the context of boosting, with results being fairly insensitive 
to particular choices in this range. One can fine-tune the value for J by 
trying several different values and choosing the one that produces the low¬ 
est risk on a validation sample. However, this seldom provides significant 
improvement over using J ~ 6. 
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Coordinate Functions for Additive Logistic Trees 



FIGURE 10.10. Coordinate functions estimated by boosting stumps for the sim¬ 
ulated example used in Figure 10.9. The true quadratic functions are shown for 
comparison. 


10.12 Regularization 

Besides the size of the constituent trees, J, the other meta-parameter of 
gradient boosting is the number of boosting iterations M. Each iteration 
usually reduces the training risk so that for M large enough this risk 

can be made arbitrarily small. However, fitting the training data too well 
can lead to overfitting, which degrades the risk on future predictions. Thus, 
there is an optimal number M* minimizing future risk that is application 
dependent. A convenient way to estimate M* is to monitor prediction risk 
as a function of M on a validation sample. The value of M that minimizes 
this risk is taken to be an estimate of M*. This is analogous to the early 
stopping strategy often used with neural networks (Section 11.4). 


10.12.1 Shrinkage 

Controlling the value of M is not the only possible regularization strategy. 
As with ridge regression and neural networks, shrinkage techniques can be 
employed as well (see Sections 3.4.1 and 11.5). The simplest implementation 
of shrinkage in the context of boosting is to scale the contribution of each 
tree by a factor 0 < v < 1 when it is added to the current approximation. 
That is, line 2(d) of Algorithm 10.3 is replaced by 

J 

fm(.<r') — fm— 1 (*r) T ^ ^ ^ ''/jrn.l £ Rjrri ) • (10.41) 

3 =1 

The parameter v can be regarded as controlling the learning rate of the 
boosting procedure. Smaller values of v (more shrinkage) result in larger 
training risk for the same number of iterations M. Thus, both v and M 
control prediction risk on the training data. However, these parameters do 
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not operate independently. Smaller values of v lead to larger values of M 
for the same training risk, so that there is a tradeoff between them. 

Empirically it has been found (Friedman, 2001) that smaller values of v 
favor better test error, and require correspondingly larger values of M. In 
fact, the best strategy appears to be to set v to be very small (y < 0.1) 
and then choose M by early stopping. This yields dramatic improvements 
(over no shrinkage v = 1) for regression and for probability estimation. The 
corresponding improvements in misclassification risk via (10.20) are less, 
but still substantial. The price paid for these improvements is computa¬ 
tional: smaller values of v give rise to larger values of M, and computation 
is proportional to the latter. However, as seen below, many iterations are 
generally computationally feasible even on very large data sets. This is 
partly due to the fact that small trees are induced at each step with no 
pruning. 

Figure 10.11 shows test error curves for the simulated example (10.2) of 
Figure 10.2. A gradient boosted model (MART) was trained using binomial 
deviance, using either stumps or six terminal-node trees, and with or with¬ 
out shrinkage. The benefits of shrinkage are evident, especially when the 
binomial deviance is tracked. With shrinkage, each test error curve reaches 
a lower value, and stays there for many iterations. 

Section 16.2.1 draws a connection between forward stagewise shrinkage 
in boosting and the use of an L\ penalty for regularizing model parame¬ 
ters (the “lasso”). We argue that L\ penalties may be superior to the L 2 
penalties used by methods such as the support vector machine. 


10.12.2 Subsampling 

We saw in Section 8.7 that bootstrap averaging (bagging) improves the 
performance of a noisy classifier through averaging. Chapter 15 discusses 
in some detail the variance-reduction mechanism of this sampling followed 
by averaging. We can exploit the same device in gradient boosting, both 
to improve performance and computational efficiency. 

With stochastic gradient boosting (Friedman, 1999), at each iteration we 
sample a fraction g of the training observations (without replacement), 
and grow the next tree using that subsample. The rest of the algorithm is 
identical. A typical value for g can be although for large N , g can be 
substantially smaller than 

Not only does the sampling reduce the computing time by the same 
fraction g, but in many cases it actually produces a more accurate model. 

Figure 10.12 illustrates the effect of subsampling using the simulated 
example (10.2), both as a classification and as a regression example. We 
see in both cases that sampling along with shrinkage slightly outperformed 
the rest. It appears here that subsampling without shrinkage does poorly. 


Test Set Deviance Test Set Deviance 
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FIGURE 10.11. Test error curves for simulated example (10.2) of Figure 10.9, 
using gradient boosting (MART). The models were trained using binomial de¬ 
viance, either stumps or six terminal-node trees, and with or without shrinkage. 
The left panels report test deviance, while the right panels show misclassification 
error. The beneficial effect of shrinkage can be seen in all cases, especially for 
deviance in the left panels. 
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4-Node Trees 


Deviance 



Absolute Error 



Boosting Iterations 


FIGURE 10.12. Test-error curves for the simulated example (10.2), showing 
the effect of stochasticity. For the curves labeled “Sample= 0.5”, a different 50% 
subsample of the training data was used each time a tree was grown. In the left 
panel the models were fit by gbm using a binomial deviance loss function; in the 
right-hand panel using square-error loss. 


The downside is that we now have four parameters to set: J, M, v and 
r\. Typically some early explorations determine suitable values for J, v and 
77, leaving M as the primary parameter. 


10.13 Interpretation 

Single decision trees are highly interpretable. The entire model can be com¬ 
pletely represented by a simple two-dimensional graphic (binary tree) that 
is easily visualized. Linear combinations of trees (10.28) lose this important 
feature, and must therefore be interpreted in a different way. 


10.13.1 Relative Importance of Predictor Variables 

In data mining applications the input predictor variables are seldom equally 
relevant. Often only a few of them have substantial influence on the re¬ 
sponse; the vast majority are irrelevant and could just as well have not 
been included. It is often useful to learn the relative importance or contri¬ 
bution of each input variable in predicting the response. 
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For a single decision tree T, Breiman et al. (1984) proposed 


j-i 

ij{T) = Y J % I Wt)=l) (10.42) 

t= l 

as a measure of relevance for each predictor variable X^. The sum is over 
the J — 1 internal nodes of the tree. At each such node t, one of the input 
variables X v u ) is used to partition the region associated with that node into 
two subregions; within each a separate constant is fit to the response values. 
The particular variable chosen is the one that gives maximal estimated 
improvement *\ in squared error risk over that for a constant fit over the 
entire region. The squared relative importance of variable Xp is the sum of 
such squared improvements over all internal nodes for which it was chosen 
as the splitting variable. 

This importance measure is easily generalized to additive tree expansions 
(10.28); it is simply averaged over the trees 

1 M 

^ = S E X <W' ( 10 - 43 ) 

m= 1 

Due to the stabilizing effect of averaging, this measure turns out to be more 
reliable than is its counterpart (10.42) for a single tree. Also, because of 
shrinkage (Section 10.12.1) the masking of important variables by others 
with which they are highly correlated is much less of a problem. Note 
that (10.42) and (10.43) refer to squared relevance; the actual relevances 
are their respective square roots. Since these measures are relative, it is 
customary to assign the largest a value of 100 and then scale the others 
accordingly. Figure 10.6 shows the relevant importance of the 57 inputs in 
predicting spam versus email. 

For A'-class classification, K separate models f k (x), k = 1,2,..., K are 
induced, each consisting of a sum of trees 

M 

f k (x) = J2 T km(x). (10.44) 

m= 1 

In this case (10.43) generalizes to 

1 M 

4 = -^I^ ro ). (10.45) 

m= 1 


Here Xe k is the relevance of Xt in separating the class k observations from 
the other classes. The overall relevance of Xu is obtained by averaging over 
all of the classes 




2 


1 

K 




1 


(10.46) 
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Figures 10.23 and 10.24 illustrate the use of these averaged and separate 
relative importances. 

10.13.2 Partial Dependence Plots 

After the most relevant variables have been identified, the next step is to 
attempt to understand the nature of the dependence of the approximation 
f(X) on their joint values. Graphical renderings of the f(X) as a function 
of its arguments provides a comprehensive summary of its dependence on 
the joint values of the input variables. 

Unfortunately, such visualization is limited to low-dimensional views. 
We can easily display functions of one or two arguments, either continuous 
or discrete (or mixed), in a variety of different ways; this book is filled 
with such displays. Functions of slightly higher dimensions can be plotted 
by conditioning on particular sets of values of all but one or two of the 
arguments, producing a trellis of plots (Becker et ah, 1996). 1 

For more than two or three variables, viewing functions of the corre¬ 
sponding higher-dimensional arguments is more difficult. A useful alterna¬ 
tive can sometimes be to view a collection of plots, each one of which shows 
the partial dependence of the approximation f(X) on a selected small sub¬ 
set of the input variables. Although such a collection can seldom provide a 
comprehensive depiction of the approximation, it can often produce helpful 
clues, especially when /( x) is dominated by low-order interactions (10.40). 

Consider the subvector X$ oi£ < p of the input predictor variables X T = 
(Xi, X 2 , ■ ■ ■ , X p ), indexed by S C {1, 2,... ,p}. Let C be the complement 
set, with 5UC = {1, 2,... ,p}. A general function f(X) will in principle 
depend on all of the input variables: f(X) = f(X$, Xc). One way to define 
the average or partial dependence of f(X) on Xg is 

fs(X s ) = V Xc f(X s ,X c ). (10.47) 

This is a marginal average of /, and can serve as a useful description of the 
effect of the chosen subset on f(X) when, for example, the variables in X$ 
do not have strong interactions with those in Xc- 

Partial dependence functions can be used to interpret the results of any 
“black box” learning method. They can be estimated by 

1 N 

fs(X s ) = -J^f( x s , Xi c), (10.48) 

where {aqc,£2C> • ■. ,Xnc} are the values of Xc occurring in the training 
data. This requires a pass over the data for each set of joint values of X$ for 
which fs{Xs) is to be evaluated. This can be computationally intensive, 
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even for moderately sized data sets. Fortunately with decision trees, fg(Xg) 
(10.48) can be rapidly computed from the tree itself without reference to 
the data (Exercise 10.11). 

It is important to note that partial dependence functions defined in 
(10.47) represent the effect of Xg on f(X) after accounting for the (av¬ 
erage) effects of the other variables Xc on f(X). They are not the effect 
of Xg on f(X) ignoring the effects of Xc . The latter is given by the con¬ 
ditional expectation 

fs(Xg) = E(f(Xg,X c )\X s ), (10.49) 

and is the best least squares approximation to f(X) by a function of Xg 
alone. The quantities fs{Xg) and fg(Xg) will be the same only in the 
unlikely event that Xg and Xg are independent. For example, if the effect 
of the chosen variable subset happens to be purely additive, 

f(X) = h 1 (Xg) + h 2 (X c ). (10.50) 

Then (10.47) produces the h\(Xg) up to an additive constant. If the effect 
is purely multiplicative, 


f(X) = h^Xg) ■ h 2 (X c ), (10.51) 

then (10.47) produces hi(Xg) up to a multiplicative constant factor. On 
the other hand, (10.49) will not produce h\{Xg) in either case. In fact, 
(10.49) can produce strong effects on variable subsets for which f(X) has 
no dependence at all. 

Viewing plots of the partial dependence of the boosted-tree approxima¬ 
tion (10.28) on selected variables subsets can help to provide a qualitative 
description of its properties. Illustrations are shown in Sections 10.8 and 
10.14. Owing to the limitations of computer graphics, and human percep¬ 
tion, the size of the subsets Xg must be small (l ss 1,2,3). There are of 
course a large number of such subsets, but only those chosen from among 
the usually much smaller set of highly relevant predictors are likely to be 
informative. Also, those subsets whose effect on f(X) is approximately 
additive (10.50) or multiplicative (10.51) will be most revealing. 

For it'-class classification, there are K separate models (10.44), one for 
each class. Each one is related to the respective probabilities (10.21) through 

1 K 

fk(X) = log p k {X) - — logP/PO- (10.52) 

i=i 

Thus each fk(X) is a monotone increasing function of its respective prob¬ 
ability on a logarithmic scale. Partial dependence plots of each respective 
fk{X) (10.44) on its most relevant predictors (10.45) can help reveal how 
the log-odds of realizing that class depend on the respective input variables. 
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10.14 Illustrations 

In this section we illustrate gradient boosting on a number of larger datasets, 
using different loss functions as appropriate. 


10.14-1 California Housing 

This data set (Pace and Barry, 1997) is available from the Carnegie-Mellon 
StatLib repository 2 . It consists of aggregated data from each of 20,460 
neighborhoods (1990 census block groups) in California. The response vari¬ 
able Y is the median house value in each neighborhood measured in units of 
$100,000. The predictor variables are demographics such as median income 
Medlnc, housing density as reflected by the number of houses House, and the 
average occupancy in each house AveOccup. Also included as predictors are 
the location of each neighborhood (longitude and latitude), and several 
quantities reflecting the properties of the houses in the neighborhood: av¬ 
erage number of rooms AveRooms and bedrooms AveBedrms. There are thus 
a total of eight predictors, all numeric. 

We fit a gradient boosting model using the MART procedure, with J = 6 
terminal nodes, a learning rate (10.41) of v = 0.1, and the Huber loss 
criterion for predicting the numeric response. We randomly divided the 
dataset into a training set (80%) and a test set (20%). 

Figure 10.13 shows the average absolute error 

AAE = E\y- f M (x)\ (10.53) 

as a function for number of iterations M on both the training data and test 
data. The test error is seen to decrease monotonically with increasing M, 
more rapidly during the early stages and then leveling off to being nearly 
constant as iterations increase. Thus, the choice of a particular value of M 
is not critical, as long as it is not too small. This tends to be the case in 
many applications. The shrinkage strategy (10.41) tends to eliminate the 
problem of overfitting, especially for larger data sets. 

The value of AAE after 800 iterations is 0.31. This can be compared to 
that of the optimal constant predictor median{t/i} which is 0.89. In terms of 
more familiar quantities, the squared multiple correlation coefficient of this 
model is R 2 = 0.84. Pace and Barry (1997) use a sophisticated spatial auto¬ 
regression procedure, where prediction for each neighborhood is based on 
median house values in nearby neighborhoods, using the other predictors as 
covariates. Experimenting with transformations they achieved R 2 = 0.85, 
predicting logE. Using logY as the response the corresponding value for 
gradient boosting was R 2 = 0.86. 


2 http://lib.stat.cmu.edu. 
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FIGURE 10.13. Average-absolute error as a function of number of iterations 
for the California housing data. 


Figure 10.14 displays the relative variable importances for each of the 
eight predictor variables. Not surprisingly, median income in the neigh¬ 
borhood is the most relevant predictor. Longitude, latitude, and average 
occupancy all have roughly half the relevance of income, whereas the others 
are somewhat less influential. 

Figure 10.15 shows single-variable partial dependence plots on the most 
relevant nonlocation predictors. Note that the plots are not strictly smooth. 
This is a consequence of using tree-based models. Decision trees produce 
discontinuous piecewise constant models (10.25). This carries over to sums 
of trees (10.28), with of course many more pieces. Unlike most of the meth¬ 
ods discussed in this book, there is no smoothness constraint imposed on 
the result. Arbitrarily sharp discontinuities can be modeled. The fact that 
these curves generally exhibit a smooth trend is because that is what is 
estimated to best predict the response for this problem. This is often the 
case. 

The hash marks at the base of each plot delineate the deciles of the 
data distribution of the corresponding variables. Note that here the data 
density is lower near the edges, especially for larger values. This causes the 
curves to be somewhat less well determined in those regions. The vertical 
scales of the plots are the same, and give a visual comparison of the relative 
importance of the different variables. 

The partial dependence of median house value on median income is 
monotonic increasing, being nearly linear over the main body of data. House 
value is generally monotonic decreasing with increasing average occupancy, 
except perhaps for average occupancy rates less than one. Median house 
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FIGURE 10.14. Relative importance of the predictors for the California housing 
data. 


value has a nonmonotonic partial dependence on average number of rooms. 
It has a minimum at approximately three rooms and is increasing both for 
smaller and larger values. 

Median house value is seen to have a very weak partial dependence on 
house age that is inconsistent with its importance ranking (Figure 10.14). 
This suggests that this weak main effect may be masking stronger interac¬ 
tion effects with other variables. Figure 10.16 shows the two-variable partial 
dependence of housing value on joint values of median age and average oc¬ 
cupancy. An interaction between these two variables is apparent. For values 
of average occupancy greater than two, house value is nearly independent 
of median age, whereas for values less than two there is a strong dependence 
on age. 

Figure 10.17 shows the two-variable partial dependence of the fitted 
model on joint values of longitude and latitude, displayed as a shaded 
contour plot. There is clearly a very strong dependence of median house 
value on the neighborhood location in California. Note that Figure 10.17 is 
not a plot of house value versus location ignoring the effects of the other 
predictors (10.49). Like all partial dependence plots, it represents the effect 
of location after accounting for the effects of the other neighborhood and 
house attributes (10.47). It can be viewed as representing an extra premium 
one pays for location. This premium is seen to be relatively large near the 
Pacific coast especially in the Bay Area and Los Angeles-San Diego re- 
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FIGURE 10.15. Partial dependence of housing value on the nonlocation vari¬ 
ables for the California housing data. The red ticks at the base of the plot are 
deciles of the input variables. 
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FIGURE 10.16. Partial dependence of house value on median age and aver¬ 
age occupancy. There appears to be a strong interaction effect between these two 
variables. 
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FIGURE 10.17. Partial dependence of median house value on location in Cal¬ 
ifornia. One unit is $100, 000, at 1990 prices, and the values plotted are relative 
to the overall median of $180, 000. 


gions. In the northern, central valley, and southeastern desert regions of 
California, location costs considerably less. 


10.14.2 New Zealand Fish 

Plant and animal ecologists use regression models to predict species pres¬ 
ence, abundance and richness as a function of environmental variables. 
Although for many years simple linear and parametric models were popu¬ 
lar, recent literature shows increasing interest in more sophisticated mod¬ 
els such as generalized additive models (Section 9.1, GAM), multivariate 
adaptive regression splines (Section 9.4, MARS) and boosted regression 
trees (Leathwick et ah, 2005; Leathwick et ah, 2006). Here we model the 
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presence and abundance of the Black Oreo Dory , a marine fish found in the 
oceanic waters around New Zealand. 3 

Figure 10.18 shows the locations of 17,000 trawls (deep-water net fishing, 
with a maximum depth of 2km), and the red points indicate those 2353 
trawls for which the Black Oreo was present, one of over a hundred species 
regularly recorded. The catch size in kg for each species was recorded for 
each trawl. Along with the species catch, a number of environmental mea¬ 
surements are available for each trawl. These include the average depth of 
the trawl (AvgDepth), and the temperature and salinity of the water. Since 
the latter two are strongly correlated with depth, Leathwick et al. (2006) 
derived instead TempResid and SalResid, the residuals obtained when these 
two measures are adjusted for depth (via separate non-parametric regres¬ 
sions) . SSTGrad is a measure of the gradient of the sea surface temperature, 
and Chla is a broad indicator of ecosytem productivity via satellite-image 
measurements. SusPartMatter provides a measure of suspended particulate 
matter, particularly in coastal waters, and is also satellite derived. 

The goal of this analysis is to estimate the probability of finding Black 
Oreo in a trawl, as well as the expected catch size, standardized to take 
into account the effects of variation in trawl speed and distance, as well 
as the mesh size of the trawl net. The authors used logistic regression 
for estimating the probability. For the catch size, it might seem natural 
to assume a Poisson distribution and model the log of the mean count, 
but this is often not appropriate because of the excessive number of zeros. 
Although specialized approaches have been developed, such as the zero- 
inflated Poisson (Lambert, 1992), they chose a simpler approach. If Y is 
the (non-negative) catch size, 

Y(Y\X) = E(yjY > 0, X) ■ Pr(F > 0|V). (10.54) 


The second term is estimated by the logistic regression, and the first term 
can be estimated using only the 2353 trawls with a positive catch. 

For the logistic regression the authors used a gradient boosted model 
(GBM) 4 with binomial deviance loss function, depth-10 trees, and a shrink¬ 
age factor v = 0.025. For the positive-catch regression, they modeled 
log(Y) using a GBM with squared-error loss (also depth-10 trees, but 
v = 0.01), and un-logged the predictions. In both cases they used 10-fold 
cross-validation for selecting the number of terms, as well as the shrinkage 
factor. 


3 The models, data, and maps shown here were kindly provided by Dr John Leathwick 
of the National Institute of Water and Atmospheric Research in New Zealand, and Dr 
Jane Elith, School of Botany, University of Melbourne. The collection of the research 
trawl data took place from 1979-2005, and was funded by the New Zealand Ministry of 
Fisheries. 

4 Version 1.5-7 of package gbm in R, ver. 2.2.0. 
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FIGURE 10.18. Map of New Zealand and its surrounding exclusive economic 
zone, showing the locations of 17,000 trawls (small blue dots) taken between 1979 
and 2005. The red points indicate trawls for which the species Black Oreo Dory 
were present. 
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FIGURE 10.19. The left panel shows the mean deviance as a function of the 
number of trees for the GBM logistic regression model fit to the presence/absence 
data. Shown are 10-fold cross-validation on the training data (and 1 x s.e. bars), 
and test deviance on the test data. Also shown for comparison is the test deviance 
using a GAM model with 8 df for each term. The right panel shows ROC curves 
on the test data for the chosen GBM model (vertical line in left plot) and the 
GAM model. 

Figure 10.19 (left panel) shows the mean binomial deviance for the se¬ 
quence of GBM models, both for 10-fold CV and test data. There is a mod¬ 
est improvement over the performance of a GAM model, fit using smoothing 
splines with 8 degrees-of-freedom (df) per term. The right panel shows the 
ROC curves (see Section 9.2.5) for both models, which measures predictive 
performance. From this point of view, the performance looks very simi¬ 
lar, with GBM perhaps having a slight edge as summarized by the AUC 
(area under the curve). At the point of equal sensitivity/specificity, GBM 
achieves 91%, and GAM 90%. 

Figure 10.20 summarizes the contributions of the variables in the logistic 
GBM fit. We see that there is a well-defined depth range over which Black 
Oreo are caught, with much more frequent capture in colder waters. We do 
not give details of the quantitative catch model; the important variables 
were much the same. 

All the predictors used in these models are available on a fine geographi¬ 
cal grid; in fact they were derived from environmental atlases, satellite im¬ 
ages and the like—see Leathwick et al. (2006) for details. This also means 
that predictions can be made on this grid, and imported into GIS mapping 
systems. Figure 10.21 shows prediction maps for both presence and catch 
size, with both standardized to a common set of trawl conditions; since the 
predictors vary in a continuous fashion with geographical location, so do 
the predictions. 
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FIGURE 10.20. The top-left panel shows the relative influence computed from 
the GBM logistic regression model. The remaining panels show the partial de¬ 
pendence plots for the leading five variables, all plotted on the same scale for 
comparison. 


Because of their ability to model interactions and automatically select 
variables, as well as robustness to outliers and missing data, GBM models 
are rapidly gaining popularity in this data-rich and enthusiastic community. 


10.14-3 Demographics Data 

I 11 this section we illustrate gradient boosting on a multiclass classifica¬ 
tion problem, using MART. The data come from 9243 questionnaires filled 
out by shopping mall customers in the San Francisco Bay Area (Impact 
Resources, Inc., Columbus, OH). Among the questions are 14 concerning 
demographics. For this illustration the goal is to predict occupation us¬ 
ing the other 13 variables as predictors, and hence identify demographic 
variables that discriminate between different occupational categories. We 
randomly divided the data into a training set (80%) and test set (20%), 
and used J = 6 node trees with a learning rate v = 0.1. 

Figure 10.22 shows the K = 9 occupation class values along with their 
corresponding error rates. The overall error rate is 42.5%, which can be 
compared to the null rate of 69% obtained by predicting the most numerous 
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FIGURE 10.21. Geological prediction maps of the presence probability (left 
map) and catch size (right map) obtained from the gradient boosted models. 

class Prof/Man (Professional/Managerial). The four best predicted classes 
are seen to be Retired, Student. Prof/Man, and Homemaker. 

Figure 10.23 shows the relative predictor variable importances as aver¬ 
aged over all classes (10.46). Figure 10.24 displays the individual relative 
importance distributions (10.45) for each of the four best predicted classes. 
One sees that the most relevant predictors are generally different for each 
respective class. An exception is age which is among the three most relevant 
for predicting Retired, Student, and Prof/Man. 

Figure 10.25 shows the partial dependence of the log-odds (10.52) on age 
for these three classes. The abscissa values are ordered codes for respective 
equally spaced age intervals. One sees that after accounting for the contri¬ 
butions of the other variables, the odds of being retired are higher for older 
people, whereas the opposite is the case for being a student. The odds of 
being professional/managerial are highest for middle-aged people. These 
results are of course not surprising. They illustrate that inspecting partial 
dependences separately for each class can lead to sensible results. 


Bibliographic Notes 

Schapire (1990) developed the first simple boosting procedure in the PAC 
learning framework (Valiant, 1984; Kearns and Vazirani, 1994). Schapire 
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FIGURE 10.22. Error rate for each occupation in the demographics data. 
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FIGURE 10.23. Relative importance of the predictors as averaged over all 
classes for the demographics data. 
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FIGURE 10.24. Predictor variable importances separately for each of the four 
classes with lowest error rate for the demographics data. 
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FIGURE 10.25. Partial dependence of the odds of three different occupations 
on age, for the demographics data. 


showed that a weak learner could always improve its performance by train¬ 
ing two additional classifiers on filtered versions of the input data stream. 
A weak learner is an algorithm for producing a two-class classifier with 
performance guaranteed (with high probability) to be significantly better 
than a coin-flip. After learning an initial classifier Gi on the first N training 
points, 

• Gi is learned on a new sample of N points, half of which are misclas- 
sified by Gi; 

• G 3 is learned on N points for which Gi and G 2 disagree; 

• the boosted classifier is Gb = majority vote(G\, G 2 , G 3 ). 

Schapire’s “Strength of Weak Learnability” theorem proves that Gb has 
improved performance over Gi. 

Freund (1995) proposed a “boost by majority” variation which combined 
many weak learners simultaneously and improved the performance of the 
simple boosting algorithm of Schapire. The theory supporting both of these 
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algorithms requires the weak learner to produce a classifier with a fixed 
error rate. This led to the more adaptive and realistic AdaBoost (Freund 
and Schapire, 1996a) and its offspring, where this assumption was dropped. 

Freund and Schapire (1996a) and Schapire and Singer (1999) provide 
some theory to support their algorithms, in the form of upper bounds on 
generalization error. This theory has evolved in the computational learning 
community, initially based on the concepts of PAC learning. Other theo¬ 
ries attempting to explain boosting come from game theory (Freund and 
Schapire, 1996b; Breiman, 1999; Breiman, 1998), and VC theory (Schapire 
et ah, 1998). The bounds and the theory associated with the AdaBoost 
algorithms are interesting, but tend to be too loose to be of practical im¬ 
portance. In practice, boosting achieves results far more impressive than 
the bounds would imply. Schapire (2002) and Meir and Ratsch (2003) give 
useful overviews more recent than the first edition of this book. 

Friedman et al. (2000) and Friedman (2001) form the basis for our expo¬ 
sition in this chapter. Friedman et al. (2000) analyze AdaBoost statistically, 
derive the exponential criterion, and show that it estimates the log-odds 
of the class probability. They propose additive tree models, the right-sized 
trees and ANOVA representation of Section 10.11, and the multiclass logit 
formulation. Friedman (2001) developed gradient boosting and shrinkage 
for classification and regression, while Friedman (1999) explored stochastic 
variants of boosting. Mason et al. (2000) also embraced a gradient approach 
to boosting. As the published discussions of Friedman et al. (2000) shows, 
there is some controversy about how and why boosting works. 

Since the publication of the first edition of this book, these debates have 
continued, and spread into the statistical community with a series of papers 
on consistency of boosting (Jiang, 2004; Lugosi and Vayatis, 2004; Zhang 
and Yu, 2005; Bartlett and Traskin, 2007). Mease and Wyner (2008), 
through a series of simulation examples, challenge some of our interpre¬ 
tations of boosting; our response (Friedman et al., 2008a) puts most of 
these objections to rest. A recent survey by Biihlmann and Hothorn (2007) 
supports our approach to boosting. 


Exercises 


Ex. 10.1 Derive expression (10.12) for the update parameter in AdaBoost. 

Ex. 10.2 Prove result (10.16), that is, the minimizer of the population 
version of the AdaBoost criterion, is one-half of the log odds. 

Ex. 10.3 Show that the marginal average (10.47) recovers additive and 
multiplicative functions (10.50) and (10.51), while the conditional expec¬ 
tation (10.49) does not. 
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Ex. 10.4 


(a) Write a program implementing AdaBoost with trees. 

(b) Redo the computations for the example of Figure 10.2. Plot the train¬ 

ing error as well as test error, and discuss its behavior. 

(c) Investigate the number of iterations needed to make the test error 

finally start to rise. 

(d) Change the setup of this example as follows: define two classes, with 
the features in Class 1 being Xi, X 2 , ■ ■., X Wl standard indepen¬ 
dent Gaussian variates. In Class 2, the features X 1} X 2 , ..., X 10 are 
also standard independent Gaussian, but conditioned on the event 
JA Xj > 12. Now the classes have significant overlap in feature space. 
Repeat the AdaBoost experiments as in Figure 10.2 and discuss the 
results. 

Ex. 10.5 Multiclass exponential loss (Zhu et ah, 2005). For a IP-class clas¬ 
sification problem, consider the coding Y = (Y-\ ,..., Yk) t with 



(10.55) 


Let / = (/i,..., f K ) T with Ef=i fk = 0, and define 



(10.56) 


(a) Using Lagrange multipliers, derive the population minimizer /* of 
E(Y,f), subject to the zero-sum constraint, and relate these to the 
class probabilities. 

(b) Show that a multiclass boosting using this loss function leads to a 
reweighting algorithm similar to Adaboost, as in Section 10.4. 

Ex. 10.6 McNemar test (Agresti, 1996). We report the test error rates on 
the spam data to be 5.5% for a generalized additive model (GAM), and 
4.5% for gradient boosting (GBM), with a test sample of size 1536. 

(a) Show that the standard error of these estimates is about 0.6%. 

Since the same test data are used for both methods, the error rates are 
correlated, and we cannot perform a two-sample t-test. We can compare 
the methods directly on each test observation, leading to the summary 


GBM 


GAM Correct Error 


1434 18 

33 51 


Correct 

Error 
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The McNemar test focuses on the discordant errors, 33 vs. 18. 

(b) Conduct a test to show that GAM makes significantly more errors 
than gradient boosting, with a two-sided p-value of 0.036. 

Ex. 10.7 Derive expression (10.32). 

Ex. 10.8 Consider a AT-class problem where the targets y t h are coded as 
1 if observation i is in class k and zero otherwise. Suppose we have a 
current model fk{x), k = 1 with Y^k=i fk( x ) = 0 (see ( 10 . 21 ) in 

Section 10.6). We wish to update the model for observations in a region R 
in predictor space, by adding constants f k {x) + 7 *,, with 7 ^ = 0 . 

(a) Write down the multinomial log-likelihood for this problem, and its 
first and second derivatives. 


(b) Using only the diagonal of the Hessian matrix in (1), and starting 
from 7 ^ = 0 Vfc, show that a one-step approximate Newton update 
for 7 ^ is 


Sxie/f Pik') 
51 ixieRP ik (1 ~ Pik) 


k = l,...,K -1, 


(10.57) 


where p ik = exp(f k (xi))/exp(J2?=i 


(c) We prefer our update to sum to zero, as the current model does. Using 
symmetry arguments, show that 


7 k = — ^“( 7 fc - k = l,...,K (10.58) 

V C—l 


is an appropriate update, where 7 ^ is defined as in (10.57) for all 
k = l,...,K. 

Ex. 10.9 Consider a AT-class problem where the targets y ik are coded as 
1 if observation i is in class k and zero otherwise. Using the multinomial 
deviance loss function ( 10 . 22 ) and the symmetric logistic transform, use 
the arguments leading to the gradient boosting Algorithm 10.3 to derive 
Algorithm 10.4. Hint: See exercise 10.8 for step 2(b)iii. 


Ex. 10.10 Show that for K = 2 class classification, only one tree needs to 
be grown at each gradient-boosting iteration. 


Ex. 10.11 Show how to compute the partial dependence function fs(X$) 
in (10.47) efficiently. 

Ex. 10.12 Referring to (10.49), let S = {1} and C = {2}, with f(Xi,X 2 ) = 
X\. Assume X\ and X 2 are bivariate Gaussian, each with mean zero, vari¬ 
ance one, and E(Ai,A 2 ) = p. Show that E(f(Xi, X 2 \X 2 ) = pX 2l even 
though / is not a function of X 2 . 
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Algorithm 10.4 Gradient Boosting for K-class Classification. 

1. Initialize fko(x) =0, k = 1,2,..., K. 

2. For m= 1 to M: 


(a) Set 

g Sk{x) 

Pk(x) = — j; - —-, k= 1,2,, K. 

(b) For k = 1 to K: 

i. Compute r ik m = Vik ~ Pk(xi), i = 1, 2,..., N. 

ii. Fit a regression tree to the targets rikm , i = 1,2, ...,N, 
giving terminal regions R jkm , j = 1,2,..., J m . 

iii. Compute 


Ijkm 


K — 1 Sxi GRjkm rikr n 

K YlxiGRjkm. \ r ikm\{l — kifem|) 


j — 1 , 2 ,..., J m ■ 


iv. Update fkmi.x') — 1(*^) F 'Ijkml(x G. Rjkrn)- 

3. Output fk{x) = fkM(x), k = 1,2,..., K. 
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Neural Networks 


11.1 Introduction 


In this chapter we describe a class of learning methods that was developed 
separately in different fields—statistics and artificial intelligence—based 
on essentially identical models. The central idea is to extract linear com¬ 
binations of the inputs as derived features, and then model the target as 
a nonlinear function of these features. The result is a powerful learning 
method, with widespread applications in many fields. We first discuss the 
projection pursuit model, which evolved in the domain of semiparamet- 
ric statistics and smoothing. The rest of the chapter is devoted to neural 
network models. 

11.2 Projection Pursuit Regression 

As in our generic supervised learning problem, assume we have an input 
vector X with p components, and a target Y. Let u} m , m = 1,2,..., M, be 
unit p-vectors of unknown parameters. The projection pursuit regression 
(PPR) model has the form 


M 



( 11 . 1 ) 


m —1 


This is an additive model, but in the derived features V m = lo^X rather 
than the inputs themselves. The functions g m are unspecified and are esti- 
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FIGURE 11.1. Perspective plots of two ridge functions. 

(Left:) g(V) = 1/[1 + exp(-5(V - 0.5))], where V = [X x + X 2 )/V2. 

(Right:) g{V) = (V + 0.1) sin(l/(V/3 + 0.1)), where V = X x . 

mated along with the directions w m using some flexible smoothing method 
(see below). 

The function g^ui^X) is called a ridge function in 1R P . It varies only 
in the direction defined by the vector ui m . The scalar variable V m = uj^X 
is the projection of X onto the unit vector w m , and we seek w m so that 
the model fits well, hence the name “projection pursuit.” Figure 11.1 shows 
some examples of ridge functions. In the example on the left to = (l/y/2)(l, 1) T , 
so that the function only varies in the direction X\ + X-i- In the example 
on the right, oj = (1, 0). 

The PPR model (11.1) is very general, since the operation of forming 
nonlinear functions of linear combinations generates a surprisingly large 
class of models. For example, the product X\ ■ X 2 can be written as [(Ai + 
X 2) 2 — (Ai — X2) 2 ]/4, and higher-order products can be represented simi¬ 
larly. 

In fact, if M is taken arbitrarily large, for appropriate choice of g m the 
PPR model can approximate any continuous function in 1R P arbitrarily 
well. Such a class of models is called a universal approximator. However 
this generality comes at a price. Interpretation of the fitted model is usually 
difficult, because each input enters into the model in a complex and multi¬ 
faceted way. As a result, the PPR model is most useful for prediction, and 
not very useful for producing an understandable model for the data. The 
M = 1 model, known as the single index model in econometrics, is an 
exception. It is slightly more general than the linear regression model, and 
offers a similar interpretation. 

How do we fit a PPR model, given training data ( Xi,yi ), i = 1,2,..., TV? 

We seek the approximate minimizers of the error function 


N 


M 


2 



( 11 . 2 ) 
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over functions g m and direction vectors ui m , m = 1 , 2 ,..., M . As in other 
smoothing problems, we need either explicitly or implicitly to impose com¬ 
plexity constraints on the g mi to avoid overfit solutions. 

Consider just one term (M = 1, and drop the subscript). Given the 
direction vector w, we form the derived variables Vi = w T Xj. Then we have 
a one-dimensional smoothing problem, and we can apply any scatterplot 
smoother, such as a smoothing spline, to obtain an estimate of g. 

On the other hand, given g , we want to minimize (11.2) over w. A Gauss- 
Newton search is convenient for this task. This is a quasi-Newton method, 
in which the part of the Hessian involving the second derivative of g is 
discarded. It can be simply derived as follows. Let w 0 id be the current 
estimate for w. We write 


g(uj T Xi) « g(u>a ld Xi) + g'(uj^ u Xi)(uj - w 0 id ) T Xi (11.3) 


to give 

N N 

[yi - 9(u T Xi)Y -JY^'^oidXi ) 2 

2 = 1 2 = 1 

(11.4) 

To minimize the right-hand side, we carry out a least squares regression 
with target oj^ ld Xi + (yi — g(uj^ ld Xi)) / g' (co^ ld Xi) on the input Xi , with weights 
ff , ( tt oid a; *) 2 an d n0 intercept (bias) term. This produces the updated coef¬ 
ficient Vector W n ew 

These two steps, estimation of g and to, are iterated until convergence. 
With more than one term in the PPR model, the model is built in a forward 
stage-wise manner, adding a pair (w m ,g,„) at each stage. 

There are a number of implementation details. 

• Although any smoothing method can in principle be used, it is conve¬ 
nient if the method provides derivatives. Local regression and smooth¬ 
ing splines are convenient. 

• After each step the g m ’s from previous steps can be readjusted using 
the backfitting procedure described in Chapter 9. While this may 
lead ultimately to fewer terms, it is not clear whether it improves 
prediction performance. 

• Usually the are not readjusted (partly to avoid excessive compu¬ 
tation), although in principle they could be as well. 

• The number of terms M is usually estimated as part of the forward 
stage-wise strategy. The model building stops when the next term 
does not appreciably improve the fit of the model. Cross-validation 
can also be used to determine M. 


T , Vi 5( W old X i) \ T 

w old Xi H- ' — ) ~0J Xi 
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There are many other applications, such as density estimation (Friedman 
et al., 1984; Friedman, 1987), where the projection pursuit idea can be used. 
In particular, see the discussion of ICA in Section 14.7 and its relationship 
with exploratory projection pursuit. However the projection pursuit re¬ 
gression model has not been widely used in the field of statistics, perhaps 
because at the time of its introduction (1981), its computational demands 
exceeded the capabilities of most readily available computers. But it does 
represent an important intellectual advance, one that has blossomed in its 
reincarnation in the field of neural networks, the topic of the rest of this 
chapter. 


11.3 Neural Networks 

The term neural network has evolved to encompass a large class of models 
and learning methods. Here we describe the most widely used “vanilla” neu¬ 
ral net, sometimes called the single hidden layer back-propagation network, 
or single layer perceptron. There has been a great deal of hype surrounding 
neural networks, making them seem magical and mysterious. As we make 
clear in this section, they are just nonlinear statistical models, much like 
the projection pursuit regression model discussed above. 

A neural network is a two-stage regression or classification model, typ¬ 
ically represented by a network diagram as in Figure 11.2. This network 
applies both to regression or classification. For regression, typically K = 1 
and there is only one output unit Y\ at the top. However, these networks 
can handle multiple quantitative responses in a seamless fashion, so we will 
deal with the general case. 

For AT-class classification, there are K units at the top, with the £;th 
unit modeling the probability of class k. There are K target measurements 
Y k , k = 1,..., K, each being coded as a 0 — 1 variable for the kth class. 

Derived features Z m are created from linear combinations of the inputs, 
and then the target Y k is modeled as a function of linear combinations of 
the Z m ^ 

Z m = a(a 0m + a^X), m = 1,..., M, 

T k =p ok +p%Z, k = l,...,K, (11.5) 

f k (X)=g k (T), k = l,...,K, 

where Z = (Z 1} Z 2 ,..., Z M ), and T = (Ti,T 2 ,,..,T K ). 

The activation function cr(v) is usually chosen to be the sigmoid cr(v) = 
1/(1 + e~ v ); see Figure 11.3 for a plot of 1/(1 + e~ v ). Sometimes Gaussian 
radial basis functions (Chapter 6) are used for the a(v), producing what is 
known as a radial basis function network. 

Neural network diagrams like Figure 11.2 are sometimes drawn with an 
additional bias unit feeding into every unit in the hidden and output layers. 
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FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network. 

Thinking of the constant “1” as an additional input feature, this bias unit 
captures the intercepts ao m and /3ofc i n model (11.5). 

The output function gk(T) allows a final transformation of the vector of 
outputs T. For regression we typically choose the identity function gk(T) = 
Tfc. Early work in IF-class classification also used the identity function, but 
this was later abandoned in favor of the softmax function 

9k(T) = . ( 11 . 6 ) 

Efci e Te 

This is of course exactly the transformation used in the multilogit model 
(Section 4.4), and produces positive estimates that sum to one. In Sec¬ 
tion 4.2 we discuss other problems with linear activation functions, in par¬ 
ticular potentially severe masking effects. 

The units in the middle of the network, computing the derived features 
Z m , are called hidden units because the values Z m are not directly ob¬ 
served. In general there can be more than one hidden layer, as illustrated 
in the example at the end of this chapter. We can think of the Z m as a 
basis expansion of the original inputs X ; the neural network is then a stan¬ 
dard linear model, or linear multilogit model, using these transformations 
as inputs. There is, however, an important enhancement over the basis- 
expansion techniques discussed in Chapter 5; here the parameters of the 
basis functions are learned from the data. 
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FIGURE 11.3. Plot of the sigmoid function a(v) = l/(l+exp(— v)) (red curve), 
commonly used in the hidden layer of a neural network. Included are u(sv) for 
s = \ (blue curve) and s = 10 (purple curve). The scale parameter s controls 
the activation rate, and we can see that large s amounts to a hard activation at 
v = 0. Note that a(s(v — vo )) shifts the activation threshold from 0 to vo- 

Notice that if a is the identity function, then the entire model collapses 
to a linear model in the inputs. Hence a neural network can be thought of 
as a nonlinear generalization of the linear model, both for regression and 
classification. By introducing the nonlinear transformation a, it greatly 
enlarges the class of linear models. In Figure 11.3 we see that the rate of 
activation of the sigmoid depends on the norm of a m , and if ||a m || is very 
small, the unit will indeed be operating in the linear part of its activation 
function. 

Notice also that the neural network model with one hidden layer has 
exactly the same form as the projection pursuit model described above. 
The difference is that the PPR model uses nonparametric functions g m (v), 
while the neural network uses a far simpler function based on a(v), with 
three free parameters in its argument. In detail, viewing the neural network 
model as a PPR model, we identify 

= /3 m a(a 0m + \\am\\(u}™X)), (11-7) 

where w m = am/Hamll is the mth unit-vector. Since crg aoS (v) = (3<j(ao + 
sv) has lower complexity than a more general nonparametric g(v), it is not 
surprising that a neural network might use 20 or 100 such functions, while 
the PPR model typically uses fewer terms (M = 5 or 10, for example). 

Finally, we note that the name “neural networks” derives from the fact 
that they were first developed as models for the human brain. Each unit 
represents a neuron, and the connections (links in Figure 11.2) represent 
synapses. In early models, the neurons fired when the total signal passed to 
that unit exceeded a certain threshold. In the model above, this corresponds 
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to use of a step function for cr(Z) and g m (T). Later the neural network was 
recognized as a useful tool for nonlinear statistical modeling, and for this 
purpose the step function is not smooth enough for optimization. Hence the 
step function was replaced by a smoother threshold function, the sigmoid 
in Figure 11.3. 


11.4 Fitting Neural Networks 

The neural network model has unknown parameters, often called weights, 
and we seek values for them that make the model fit the training data well. 
We denote the complete set of weights by 6, which consists of 

{ao m ,oe m ; m = 1,2, M(p + 1) weights, 

{Ad fc, Ad k = 1,2,..., K} K(M + 1) weights. 

For regression, we use sum-of-squared errors as our measure of fit (error 
function) 


K N 

m = (u-9) 

k—l i=l 

For classification we use either squared error or cross-entropy (deviance): 

N K 

R (8) = (n.io) 

z=l k—l 

and the corresponding classifier is G(x) = argma x k fk(x). With the softmax 
activation function and the cross-entropy error function, the neural network 
model is exactly a linear logistic regression model in the hidden units, and 
all the parameters are estimated by maximum likelihood. 

Typically we don’t want the global minimizer of R(9), as this is likely 
to be an overfit solution. Instead some regularization is needed: this is 
achieved directly through a penalty term, or indirectly by early stopping. 
Details are given in the next section. 

The generic approach to minimizing R(6) is by gradient descent, called 
back-propagation in this setting. Because of the compositional form of the 
model, the gradient can be easily derived using the chain rule for differen¬ 
tiation. This can be computed by a forward and backward sweep over the 
network, keeping track only of quantities local to each unit. 
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Here is back-propagation in detail for squared error loss. Let z m i = 
er(ao m + from (11.5) and let Zi = (zu,Z 2 i, ZMi)- Then we have 


N 

R(0) = 

N K 

= ~ fk{Xi)) 2 , 

1=1 fe =1 

with derivatives 

2 (Vik fk(%i))9k(fik Zi) z mii 

K 

'y \ 2(yik fk(. x i))9k{Pk Zi)fikm& (ot m x i) x il- 

k =1 


OR, 

Ofikm 

dRj 

O&mi 


( 11 . 11 ) 


( 11 . 12 ) 


Given these derivatives, a gradient descent update at the (r + l)st iter¬ 
ation has the form 


o(r+ 1) 
J km 


= /3 (r) ■ 

^km 


a, 


(r+l) 

mi 


= OL, 
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■E 
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dRi 

d&l ’ 
N dR*_ 
9a { ^\ ’ 


(11.13) 


where 7 r is the learning rate, discussed below. 
Now write (11.12) as 


dRj 

0(3 km 

0R l 


SkiZ 


mi i 


Oc%m£ 


— Smi x i£- 


(11.14) 


The quantities Ski and s m i are “errors” from the current model at the 
output and hidden layer units, respectively. From their definitions, these 
errors satisfy 

K 

Smi — U" (d m Xj) y ( PkmSki, (11.15) 

fc= 1 


known as the back-propagation equations. Using this, the updates in (11.13) 
can be implemented with a two-pass algorithm. In the forward pass, the 
current weights are fixed and the predicted values fk{xi ) are computed 
from formula (11.5). In the backward pass, the errors Ski are computed, 
and then back-propagated via (11.15) to give the errors s m *. Both sets of 
errors are then used to compute the gradients for the updates in (11.13), 
via (11.14). 
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This two-pass procedure is what is known as back-propagation. It has 
also been called the delta rule (Widrow and Hoff, 1960). The computational 
components for cross-entropy have the same form as those for the sum of 
squares error function, and are derived in Exercise 11.3. 

The advantages of back-propagation are its simple, local nature. In the 
back propagation algorithm, each hidden unit passes and receives infor¬ 
mation only to and from units that share a connection. Hence it can be 
implemented efficiently on a parallel architecture computer. 

The updates in (11.13) are a kind of batch learning , with the parame¬ 
ter updates being a sum over all of the training cases. Learning can also 
be carried out online—processing each observation one at a time, updat¬ 
ing the gradient after each training case, and cycling through the training 
cases many times. In this case, the sums in equations (11.13) are replaced 
by a single summand. A training epoch refers to one sweep through the 
entire training set. Online training allows the network to handle very large 
training sets, and also to update the weights as new observations come in. 

The learning rate "/ r for batch learning is usually taken to be a con¬ 
stant, and can also be optimized by a line search that minimizes the error 
function at each update. With online learning 7 r should decrease to zero 
as the iteration r —> 00 . This learning is a form of stochastic approxima¬ 
tion (Robbins and Munro, 1951); results in this field ensure convergence if 
7 r ~> 0 , 7 r = 00 , and lr < 00 (satisfied, for example, by y r = 1 /r). 

Back-propagation can be very slow, and for that reason is usually not 
the method of choice. Second-order techniques such as Newton’s method 
are not attractive here, because the second derivative matrix of R (the 
Hessian) can be very large. Better approaches to fitting include conjugate 
gradients and variable metric methods. These avoid explicit computation 
of the second derivative matrix while still providing faster convergence. 


11.5 Some Issues in Training Neural Networks 

There is quite an art in training neural networks. The model is generally 
overparametrized, and the optimization problem is nonconvex and unstable 
unless certain guidelines are followed. In this section we summarize some 
of the important issues. 


11.5.1 Starting Values 

Note that if the weights are near zero, then the operative part of the sigmoid 
(Figure 11.3) is roughly linear, and hence the neural network collapses into 
an approximately linear model (Exercise 11.2). Usually starting values for 
weights are chosen to be random values near zero. Hence the model starts 
out nearly linear, and becomes nonlinear as the weights increase. Individual 
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units localize to directions and introduce nonlinearities where needed. Use 
of exact zero weights leads to zero derivatives and perfect symmetry, and 
the algorithm never moves. Starting instead with large weights often leads 
to poor solutions. 


11.5.2 Overfitting 

Often neural networks have too many weights and will overfit the data at 
the global minimum of R. In early developments of neural networks, either 
by design or by accident, an early stopping rule was used to avoid over¬ 
fitting. Here we train the model only for a while, and stop well before we 
approach the global minimum. Since the weights start at a highly regular¬ 
ized (linear) solution, this has the effect of shrinking the final model toward 
a linear model. A validation dataset is useful for determining when to stop, 
since we expect the validation error to start increasing. 

A more explicit method for regularization is weight decay , which is anal¬ 
ogous to ridge regression used for linear models (Section 3.4.1). We add a 
penalty to the error function R{9) + A J{9), where 


■W) = 5>L. + I>^ (11.16) 

km mi 


and A > 0 is a tuning parameter. Larger values of A will tend to shrink 
the weights toward zero: typically cross-validation is used to estimate A. 
The effect of the penalty is to simply add terms 2/3fc m and 2aw to the 
respective gradient expressions (11.13). Other forms for the penalty have 
been proposed, for example, 
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(11.17) 


known as the weight elimination penalty. This has the effect of shrinking 
smaller weights more than (11.16) does. 

Figure 11.4 shows the result of training a neural network with ten hidden 
units, without weight decay (upper panel) and with weight decay (lower 
panel), to the mixture example of Chapter 2. Weight decay has clearly 
improved the prediction. Figure 11.5 shows heat maps of the estimated 
weights from the training (grayscale versions of these are called Hinton 
diagrams.) We see that weight decay has dampened the weights in both 
layers: the resulting weights are spread fairly evenly over the ten hidden 
units. 


11.5.3 Scaling of the Inputs 

Since the scaling of the inputs determines the effective scaling of the weights 
in the bottom layer, it can have a large effect on the quality of the final 
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Neural Network - 10 Units, No Weight Decay 



Neural Network -10 Units, Weight Decay=0.02 



FIGURE 11.4. A neural network on the mixture example of Chapter 2. The 
upper panel uses no weight decay, and overfits the training data. The lower panel 
uses weight decay, and achieves dose to the Bayes error rate (broken purple 
boundary). Both use the softmax activation function and cross-entropy error. 
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FIGURE 11.5. Heat maps of the estimated weights from the training of neural 
networks from Figure 11.f. The display ranges from bright green (negative) to 
bright red (positive). 


solution. At the outset it is best to standardize all inputs to have mean zero 
and standard deviation one. This ensures all inputs are treated equally in 
the regularization process, and allows one to choose a meaningful range for 
the random starting weights. With standardized inputs, it is typical to take 
random uniform weights over the range [—0.7,+0.7]. 

11.5.4 Number of Hidden Units and Layers 

Generally speaking it is better to have too many hidden units than too few. 
With too few hidden units, the model might not have enough flexibility to 
capture the nonlinearities in the data; with too many hidden units, the 
extra weights can be shrunk toward zero if appropriate regularization is 
used. Typically the number of hidden units is somewhere in the range of 
5 to 100, with the number increasing with the number of inputs and num¬ 
ber of training cases. It is most common to put down a reasonably large 
number of units and train them with regularization. Some researchers use 
cross-validation to estimate the optimal number, but this seems unneces¬ 
sary if cross-validation is used to estimate the regularization parameter. 
Choice of the number of hidden layers is guided by background knowledge 
and experimentation. Each layer extracts features of the input for regres¬ 
sion or classification. Use of multiple hidden layers allows construction of 
hierarchical features at different levels of resolution. An example of the 
effective use of multiple layers is given in Section 11.6. 

11.5.5 Multiple Minima 

The error function R{9) is nonconvex, possessing many local minima. As a 
result, the final solution obtained is quite dependent on the choice of start- 
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ing weights. One must at least try a number of random starting configura¬ 
tions, and choose the solution giving lowest (penalized) error. Probably a 
better approach is to use the average predictions over the collection of net¬ 
works as the final prediction (Ripley, 1996). This is preferable to averaging 
the weights, since the nonlinearity of the model implies that this averaged 
solution could be quite poor. Another approach is via bagging, which aver¬ 
ages the predictions of networks training from randomly perturbed versions 
of the training data. This is described in Section 8.7. 


11.6 Example: Simulated Data 

We generated data from two additive error models Y = f(X) + e: 

Sum of sigmoids: Y = a(afX) + a{a^X) + £ i; 

10 

Radial: Y = q i>(X m ) +e 2 . 

m= 1 

Here X T = (Xi, X 2 , ■ ■ ■, X p ), each Xj being a standard Gaussian variate, 
with p = 2 in the first model, and p = 10 in the second. 

For the sigmoid model, a\ = (3,3), < 3,2 = (3,-3); for the radial model, 
4>{t) = (1/27I-) 1 / 2 exp(—f 2 /2). Both E\ and £2 are Gaussian errors, with 
variance chosen so that the signal-to-noise ratio 

Var(E(T|A)) _ Var(/(A)) 

Var(y - E(Y\X)) Var(s) 1 ' 1 

is 4 in both models. We took a training sample of size 100 and a test sample 
of size 10, 000. We fit neural networks with weight decay and various num¬ 
bers of hidden units, and recorded the average test error Exest(^ — f(X)) 2 
for each of 10 random starting weights. Only one training set was gen¬ 
erated, but the results are typical for an “average” training set. The test 
errors are shown in Figure 11.6. Note that the zero hidden unit model refers 
to linear least squares regression. The neural network is perfectly suited to 
the sum of sigmoids model, and the two-unit model does perform the best, 
achieving an error close to the Bayes rate. (Recall that the Bayes rate for 
regression with squared error is the error variance; in the figures, we report 
test error relative to the Bayes error). Notice, however, that with more hid¬ 
den units, overfitting quickly creeps in, and with some starting weights the 
model does worse than the linear model (zero hidden unit) model. Even 
with two hidden units, two of the ten starting weight configurations pro¬ 
duced results no better than the linear model, confirming the importance 
of multiple starting values. 

A radial function is in a sense the most difficult for the neural net, as it is 
spherically symmetric and with no preferred directions. We see in the right 
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FIGURE 11.6. Boxplots of test error, for simulated data example, relative to 
the Bayes error (broken horizontal line). True function is a sum of two sigmoids 
on the left, and a radial function is on the right. The test error is displayed for 
10 different starting weights, for a single hidden layer neural network with the 
number of units as indicated. 


panel of Figure 11.6 that it does poorly in this case, with the test error 
staying well above the Bayes error (note the different vertical scale from 
the left panel). In fact, since a constant fit (such as the sample average) 
achieves a relative error of 5 (when the SNR is 4), we see that the neural 
networks perform increasingly worse than the mean. 

In this example we used a fixed weight decay parameter of 0.0005, rep¬ 
resenting a mild amount of regularization. The results in the left panel of 
Figure 11.6 suggest that more regularization is needed with greater num¬ 
bers of hidden units. 

In Figure 11.7 we repeated the experiment for the sum of sigmoids model, 
with no weight decay in the left panel, and stronger weight decay (A = 0.1) 
in the right panel. With no weight decay, overfitting becomes even more 
severe for larger numbers of hidden units. The weight decay value A = 0.1 
produces good results for all numbers of hidden units, and there does not 
appear to be overfitting as the number of units increase. Finally, Figure 11.8 
shows the test error for a ten hidden unit network, varying the weight decay 
parameter over a wide range. The value 0.1 is approximately optimal. 

In summary, there are two free parameters to select: the weight decay A 
and number of hidden units M. As a learning strategy, one could fix either 
parameter at the value corresponding to the least constrained model, to 
ensure that the model is rich enough, and use cross-validation to choose 
the other parameter. Here the least constrained values are zero weight decay 
and ten hidden units. Comparing the left panel of Figure 11.7 to Figure 
11 .8, we see that the test error is less sensitive to the value of the weight 
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FIGURE 11.7. Boxplots of test error, for simulated data example, relative to the 
Bayes error. True function is a sum of two sigmoids. The test error is displayed 
for ten different starting weights, for a single hidden layer neural network with 
the number units as indicated. The two panels represent no weight decay (left) 
and strong weight decay A = 0.1 (right). 


Sum of Sigmoids, 10 Hidden Unit Model 
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FIGURE 11.8. Boxplots of test error, for simulated data example. True function 
is a sum of two sigmoids. The test error is displayed for ten different starting 
weights, for a single hidden layer neural network with ten hidden units and weight 
decay parameter value as indicated. 
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FIGURE 11.9. Examples of training cases from ZIP code data. Each image is 
a 16 x 16 9,-bit grayscale representation of a handwritten digit. 

decay parameter, and hence cross-validation of this parameter would be 
preferred. 


11.7 Example: ZIP Code Data 

This example is a character recognition task: classification of handwritten 
numerals. This problem captured the attention of the machine learning and 
neural network community for many years, and has remained a benchmark 
problem in the field. Figure 11.9 shows some examples of normalized hand¬ 
written digits, automatically scanned from envelopes by the U.S. Postal 
Service. The original scanned digits are binary and of different sizes and 
orientations; the images shown here have been deslanted and size normal¬ 
ized, resulting in 16 x 16 grayscale images (Le Cun et al., 1990). These 256 
pixel values are used as inputs to the neural network classifier. 

A black box neural network is not ideally suited to this pattern recogni¬ 
tion task, partly because the pixel representation of the images lack certain 
invariances (such as small rotations of the image). Consequently early at¬ 
tempts with neural networks yielded misclassification rates around 4.5% 
on various examples of the problem. In this section we show some of the 
pioneering efforts to handcraft the neural network to overcome some these 
deficiencies (Le Cun, 1989), which ultimately led to the state of the art in 
neural network performance(Le Cun et al., 1998) 1 . 

Although current digit datasets have tens of thousands of training and 
test examples, the sample size here is deliberately modest in order to em- 


1 The figures and tables in this example were recreated from Le Cun (1989). 
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Net-4 Shared Weights Net _ 5 

FIGURE 11.10. Architecture of the five networks used in the ZIP code example. 


phasize the effects. The examples were obtained by scanning some actual 
hand-drawn digits, and then generating additional images by random hor¬ 
izontal shifts. Details may be found in Le Cun (1989). There are 320 digits 
in the training set, and 160 in the test set. 

Five different networks were fit to the data: 

Net-1: No hidden layer, equivalent to multinomial logistic regression. 

Net-2: One hidden layer, 12 hidden units fully connected. 

Net-3: Two hidden layers locally connected. 

Net-4: Two hidden layers, locally connected with weight sharing. 

Net-5: Two hidden layers, locally connected, two levels of weight sharing. 

These are depicted in Figure 11.10. Net-1 for example has 256 inputs, one 
each for the 16 x 16 input pixels, and ten output units for each of the digits 
0-9. The predicted value /*,( x) represents the estimated probability that 
an image x has digit class k. for k = 0,1, 2,..., 9. 
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FIGURE 11.11. Test performance curves, as a function of the number of train¬ 
ing epochs, for the five networks of Table 11.1 applied to the ZIP code data. 
(Le Cun, 1989) 

The networks all have sigmoidal output units, and were all fit with the 
sum-of-squares error function. The first network has no hidden layer, and 
hence is nearly equivalent to a linear multinomial regression model (Exer¬ 
cise 11.4). Net-2 is a single hidden layer network with 12 hidden units, of 
the kind described above. 

The training set error for all of the networks was 0%, since in all cases 
there are more parameters than training observations. The evolution of the 
test error during the training epochs is shown in Figure 11.11. The linear 
network (Net-1) starts to overfit fairly quickly, while test performance of 
the others level off at successively superior values. 

The other three networks have additional features which demonstrate 
the power and flexibility of the neural network paradigm. They introduce 
constraints on the network, natural for the problem at hand, which allow 
for more complex connectivity but fewer parameters. 

Net-3 uses local connectivity: this means that each hidden unit is con¬ 
nected to only a small patch of units in the layer below. In the first hidden 
layer (an 8x8 array), each unit takes inputs from a 3 x 3 patch of the input 
layer; for units in the first hidden layer that are one unit apart, their recep¬ 
tive fields overlap by one row or column, and hence are two pixels apart. 
In the second hidden layer, inputs are from a 5 x 5 patch, and again units 
that are one unit apart have receptive fields that are two units apart. The 
weights for all other connections are set to zero. Local connectivity makes 
each unit responsible for extracting local features from the layer below, and 
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TABLE 11.1. Test set performance of five different neural networks on a hand¬ 
written digit classification example (Le Cun, 1989). 


Network Architecture 

Links 

Weights 

% Correct 

Net-1 

Single layer network 

2570 

2570 

80.0% 

Net-2 

Two layer network 

3214 

3214 

87.0% 

Net-3 

Locally connected 

1226 

1226 

88.5% 

Net-4 

Constrained network 1 

2266 

1132 

94.0% 

Net-5 

Constrained network 2 

5194 

1060 

98.4% 


reduces considerably the total number of weights. With many more hidden 
units than Net-2, Net-3 has fewer links and hence weights (1226 vs. 3214), 
and achieves similar performance. 

Net-4 and Net-5 have local connectivity with shared weights. All units 
in a local feature map perform the same operation on different parts of the 
image, achieved by sharing the same weights. The first hidden layer of Net- 
4 has two 8x8 arrays, and each unit takes input from a 3 x 3 patch just like 
in Net-3. However, each of the units in a single 8x8 feature map share the 
same set of nine weights (but have their own bias parameter). This forces 
the extracted features in different parts of the image to be computed by 
the same linear functional, and consequently these networks are sometimes 
known as convolutional networks. The second hidden layer of Net-4 has 
no weight sharing, and is the same as in Net-3. The gradient of the error 
function R with respect to a shared weight is the sum of the gradients of 
R with respect to each connection controlled by the weights in question. 

Table 11.1 gives the number of links, the number of weights and the 
optimal test performance for each of the networks. We see that Net-4 has 
more links but fewer weights than Net-3, and superior test performance. 
Net-5 has four 4x4 feature maps in the second hidden layer, each unit 
connected to a 5 x 5 local patch in the layer below. Weights are shared 
in each of these feature maps. We see that Net-5 does the best, having 
errors of only 1.6%, compared to 13% for the “vanilla” network Net-2. 
The clever design of network Net-5, motivated by the fact that features of 
handwriting style should appear in more than one part of a digit, was the 
result of many person years of experimentation. This and similar networks 
gave better performance on ZIP code problems than any other learning 
method at that time (early 1990s). This example also shows that neural 
networks are not a fully automatic tool, as they are sometimes advertised. 
As with all statistical models, subject matter knowledge can and should be 
used to improve their performance. 

This network was later outperformed by the tangent distance approach 
(Simard et al., 1993) described in Section 13.3.3, which explicitly incorpo¬ 
rates natural affine invariances. At this point the digit recognition datasets 
become test beds for every new learning procedure, and researchers worked 
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hard to drive down the error rates. As of this writing, the best error rates on 
a large database (60,000 training, 10,000 test observations), derived from 
standard NIST 2 databases, were reported to be the following: (Le Cun et 
al., 1998): 

• 1.1% for tangent distance with a 1-nearest neighbor classifier (Sec¬ 
tion 13.3.3); 

• 0.8% for a degree-9 polynomial SVM (Section 12.3); 

• 0.8% for LeNet-5 , a more complex version of the convolutional net¬ 
work described here; 

• 0.7% for boosted LeNet-4■ Boosting is described in Chapter 8. LeNet- 
4 is a predecessor of LeNet-5. 

Le Cun et al. (1998) report a much larger table of performance results, and 
it is evident that many groups have been working very hard to bring these 
test error rates down. They report a standard error of 0.1% on the error 
estimates, which is based on a binomial average with N = 10,000 and 
p ss 0.01. This implies that error rates within 0.1 0.2% of one another 

are statistically equivalent. Realistically the standard error is even higher, 
since the test data has been implicitly used in the tuning of the various 
procedures. 


11.8 Discussion 

Both projection pursuit regression and neural networks take nonlinear func¬ 
tions of linear combinations (“derived features”) of the inputs. This is a 
powerful and very general approach for regression and classification, and 
has been shown to compete well with the best learning methods on many 
problems. 

These tools are especially effective in problems with a high signal-to-noise 
ratio and settings where prediction without interpretation is the goal. They 
are less effective for problems where the goal is to describe the physical pro¬ 
cess that generated the data and the roles of individual inputs. Each input 
enters into the model in many places, in a nonlinear fashion. Some authors 
(Hinton, 1989) plot a diagram of the estimated weights into each hidden 
unit, to try to understand the feature that each unit is extracting. This 
is limited however by the lack of identifiability of the parameter vectors 
a m , m = 1,..., M. Often there are solutions with a m spanning the same 
linear space as the ones found during training, giving predicted values that 


2 The National Institute of Standards and Technology maintain large databases, in¬ 
cluding handwritten character databases; http://www.nist.gov/srd/. 
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are roughly the same. Some authors suggest carrying out a principal com¬ 
ponent analysis of these weights, to try to find an interpretable solution. In 
general, the difficulty of interpreting these models has limited their use in 
fields like medicine, where interpretation of the model is very important. 

There has been a great deal of research on the training of neural net¬ 
works. Unlike methods like CART and MARS, neural networks are smooth 
functions of real-valued parameters. This facilitates the development of 
Bayesian inference for these models. The next sections discusses a success¬ 
ful Bayesian implementation of neural networks. 


11.9 Bayesian Neural Nets and the NIPS 2003 
Challenge 

A classification competition was held in 2003, in which five labeled train¬ 
ing datasets were provided to participants. It was organized for a Neural 
Information Processing Systems (NIPS) workshop. Each of the data sets 
constituted a two-class classification problems, with different sizes and from 
a variety of domains (see Table 11.2). Feature measurements for a valida¬ 
tion dataset were also available. 

Participants developed and applied statistical learning procedures to 
make predictions on the datasets, and could submit predictions to a web¬ 
site on the validation set for a period of 12 weeks. With this feedback, 
participants were then asked to submit predictions for a separate test set 
and they received their results. Finally, the class labels for the validation 
set were released and participants had one week to train their algorithms 
on the combined training and validation sets, and submit their final pre¬ 
dictions to the competition website. A total of 75 groups participated, with 
20 and 16 eventually making submissions on the validation and test sets, 
respectively. 

There was an emphasis on feature extraction in the competition. Arti¬ 
ficial “probes” were added to the data: these are noise features with dis¬ 
tributions resembling the real features but independent of the class labels. 
The percentage of probes that were added to each dataset, relative to the 
total set of features, is shown on Table 11.2. Thus each learning algorithm 
had to figure out a way of identifying the probes and downweighting or 
eliminating them. 

A number of metrics were used to evaluate the entries, including the 
percentage correct on the test set, the area under the ROC curve, and a 
combined score that compared each pair of classifiers head-to-head. The 
results of the competition are very interesting and are detailed in Guyon et 
al. (2006). The most notable result: the entries of Neal and Zhang (2006) 
were the clear overall winners. In the final competition they finished first 
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TABLE 11.2. NIPS 2003 challenge data sets. The column labeledp is the number 
of features. For the Dorothea dataset the features are binary. Nt r , N va i and Nte 
are the number of training, validation and test cases, respectively 


Dataset 

Domain 

Feature 

Type 

P 

Percent 

Probes 

N tr 

N va l 

Nte 

Arcene 

Mass spectrometry 

Dense 

10,000 

30 

100 

100 

700 

Dexter 

Text classification 

Sparse 

20,000 

50 

300 

300 

2000 

Dorothea 

Drug discovery 

Sparse 

100,000 

50 

800 

350 

800 

Gisette 

Digit recognition 

Dense 

5000 

30 

6000 

1000 

6500 

Madelon 

Artificial 

Dense 

500 

96 

2000 

600 

1800 


in three of the five datasets, and were 5th and 7th on the remaining two 
datasets. 

In their winning entries, Neal and Zhang (2006) used a series of pre¬ 
processing feature-selection steps, followed by Bayesian neural networks, 
Dirichlet diffusion trees, and combinations of these methods. Here we focus 
only on the Bayesian neural network approach, and try to discern which 
aspects of their approach were important for its success. We rerun their 
programs and compare the results to boosted neural networks and boosted 
trees, and other related methods. 


11.9.1 Bayes, Boosting and Bagging 

Let us first review briefly the Bayesian approach to inference and its appli¬ 
cation to neural networks. Given training data X tr ,ytr ; we assume a sam¬ 
pling model with parameters 0; Neal and Zhang (2006) use a two-hidden- 
layer neural network, with output nodes the class probabilities Pr(Y\X,0) 
for the binary outcomes. Given a prior distribution Pr(0), the posterior 
distribution for the parameters is 


Pr(0|X tr ,y tr ) 


Pr(fl)Pr(y tr |X tr ,fl) 

/ Pr(0)Pr(y tr |X tr , 6)d0 


(11.19) 


For a test case with features X new , the predictive distribution for the 
label Ynew is 


Pr(F new |X new ,X tr ,y tr ) = j Pr(Y new |X new , 0)Pr(0|X tr , y tr )df? (11.20) 

(c.f. equation 8.24). Since the integral in (11.20) is intractable, sophisticated 
Markov Chain Monte Carlo (MCMC) methods are used to sample from the 
posterior distribution Pr(Y new |A' new , X tr , ytr)- A few hundred values 0 are 
generated and then a simple average of these values estimates the integral. 
Neal and Zhang (2006) use diffuse Gaussian priors for all of the parame¬ 
ters. The particular MCMC approach that was used is called hybrid Monte 
Carlo , and may be important for the success of the method. It includes 
an auxiliary momentum vector and implements Hamiltonian dynamics in 
which the potential function is the target density. This is done to avoid 
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random walk behavior; the successive candidates move across the sample 
space in larger steps. They tend to be less correlated and hence converge 
to the target distribution more rapidly. 

Neal and Zhang (2006) also tried different forms of pre-processing of the 
features: 

1 . univariate screening using t-tests, and 

2 . automatic relevance determination. 

In the latter method (ARD), the weights (coefficients) for the jth feature 
to each of the first hidden layer units all share a common prior variance 
cr|, and prior mean zero. The posterior distributions for each variance cr| 
are computed, and the features whose posterior variance concentrates on 
small values are discarded. 

There are thus three main features of this approach that could be im¬ 
portant for its success: 

(a) the feature selection and pre-processing, 

(b) the neural network model, and 

(c) the Bayesian inference for the model using MCMC. 

According to Neal and Zhang (2006), feature screening in (a) is carried 
out purely for computational efficiency; the MCMC procedure is slow with 
a large number of features. There is no need to use feature selection to avoid 
overfitting. The posterior average (11.20) takes care of this automatically. 

We would like to understand the reasons for the success of the Bayesian 
method. In our view, power of modern Bayesian methods does not lie in 
their use as a formal inference procedure; most people would not believe 
that the priors in a high-dimensional, complex neural network model are 
actually correct. Rather the Bayesian/MCMC approach gives an efficient 
way of sampling the relevant parts of model space, and then averaging the 
predictions for the high-probability models. 

Bagging and boosting are non-Bayesian procedures that have some simi¬ 
larity to MCMC in a Bayesian model. The Bayesian approach fixes the data 
and perturbs the parameters, according to current estimate of the poste¬ 
rior distribution. Bagging perturbs the data in an i.i.d fashion and then 
re-estimates the model to give a new set of model parameters. At the end, 
a simple average of the model predictions from different bagged samples is 
computed. Boosting is similar to bagging, but fits a model that is additive 
in the models of each individual base learner, which are learned using non 
i.i.d. samples. We can write all of these models in the form 

L 

/( x new) = ^ ) 7iy E ( V new | ^new) ) 


( 11 . 21 ) 
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In all cases the Of. are a large collection of model parameters. For the 
Bayesian model the Wf = 1/L, and the average estimates the posterior 
mean (11.21) by sampling Of from the posterior distribution. For bagging, 
wi = 1 /L as well, and the 0f are the parameters refit to bootstrap re¬ 
samples of the training data. For boosting, the weights are all equal to 
1 , but the 6i are typically chosen in a nonrandom sequential fashion to 
constantly improve the fit. 


11.9.2 Performance Comparisons 

Based on the similarities above, we decided to compare Bayesian neural 
networks to boosted trees, boosted neural networks, random forests and 
bagged neural networks on the five datasets in Table 11.2. Bagging and 
boosting of neural networks are not methods that we have previously used 
in our work. We decided to try them here, because of the success of Bayesian 
neural networks in this competition, and the good performance of bagging 
and boosting with trees. We also felt that by bagging and boosting neural 
nets, we could assess both the choice of model as well as the model search 
strategy. 

Here are the details of the learning methods that were compared: 

Bayesian neural nets. The results here are taken from Neal and Zhang 
(2006), using their Bayesian approach to fitting neural networks. The 
models had two hidden layers of 20 and 8 units. We re-ran some 
networks for timing purposes only. 

Boosted trees. We used the gbm package (version 1.5-7) in the R language. 
Tree depth and shrinkage factors varied from dataset to dataset. We 
consistently bagged 80% of the data at each boosting iteration (the 
default is 50%). Shrinkage was between 0.001 and 0.1. Tree depth was 
between 2 and 9. 

Boosted neural networks. Since boosting is typically most effective with 
“weak” learners, we boosted a single hidden layer neural network with 
two or four units, fit with the nnet package (version 7.2-36) in R. 


Random forests. We used the R package randomForest (version 4.5-16) 
with default settings for the parameters. 

Bagged neural networks. We used the same architecture as in the Bayesian 
neural network above (two hidden layers of 20 and 8 units), fit using 
both Neal’s C language package “Flexible Bayesian Modeling” (2004- 
11-10 release), and Matlab neural-net toolbox (version 5.1). 
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Univariate Screened Features 


ARD Reduced Features 




FIGURE 11.12. Performance of different learning methods on five problems, 
using both univariate screening of features (top panel) and a reduced feature set 
from automatic relevance determination. The error bars at the top of each plot 
have width equal to one standard error of the difference between two error rates. 
On most of the problems several competitors are within this error bound. 


This analysis was carried out by Nicholas Johnson, and full details may 
be found in Johnson (2008) 3 . The results are shown in Figure 11.12 and 
Table 11.3. 

The figure and table show Bayesian, boosted and bagged neural networks, 
boosted trees, and random forests, using both the screened and reduced 
features sets. The error bars at the top of each plot indicate one standard 
error of the difference between two error rates. Bayesian neural networks 
again emerge as the winner, although for some datasets the differences 
between the test error rates is not statistically significant. Random forests 
performs the best among the competitors using the selected feature set, 
while the boosted neural networks perform best with the reduced feature 
set, and nearly match the Bayesian neural net. 

The superiority of boosted neural networks over boosted trees suggest 
that the neural network model is better suited to these particular prob¬ 
lems. Specifically, individual features might not be good predictors here 


3 We also thank Isabelle Guyon for help in preparing the results of this section. 
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TABLE 11.3. Performance of different methods. Values are average rank of test 
error across the five problems (low is good), and mean computation time and 
standard error of the mean, in minutes. 


Method 

Screened Features 

ARD Reduced Features 

Average 

Rank 

Average 

Time 

Average 

Rank 

Average 

Time 

Bayesian neural networks 

1.5 

384(138) 

1.6 

600(186) 

Boosted trees 

3.4 

3.03(2.5) 

4.0 

34.1(32.4) 

Boosted neural networks 

3.8 

9.4(8.6) 

2.2 

35.6(33.5) 

Random forests 

2.7 

1.9(1.7) 

3.2 

11.2(9.3) 

Bagged neural networks 

3.6 

3.5(1.1) 

4.0 

6.4(4.4) 


and linear combinations of features work better. However the impressive 
performance of random forests is at odds with this explanation, and came 
as a surprise to us. 

Since the reduced feature sets come from the Bayesian neural network 
approach, only the methods that use the screened features are legitimate, 
self-contained procedures. However, this does suggest that better methods 
for internal feature selection might help the overall performance of boosted 
neural networks. 

The table also shows the approximate training time required for each 
method. Here the non-Bayesian methods show a clear advantage. 

Overall, the superior performance of Bayesian neural networks here may 
be due to the fact that 

(a) the neural network model is well suited to these five problems, and 

(b) the MCMC approach provides an efficient way of exploring the im¬ 
portant part of the parameter space, and then averaging the resulting 
models according to their quality. 

The Bayesian approach works well for smoothly parametrized models like 
neural nets; it is not yet clear that it works as well for non-smooth models 
like trees. 


11.10 Computational Considerations 

With N observations, p predictors, M hidden units and L training epochs, a 
neural network fit typically requires O(NpML) operations. There are many 
packages available for fitting neural networks, probably many more than 
exist for mainstream statistical methods. Because the available software 
varies widely in quality, and the learning problem for neural networks is 
sensitive to issues such as input scaling, such software should be carefully 
chosen and tested. 
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Bibliographic Notes 

Projection pursuit was proposed by Friedman and Tukey (1974), and spe¬ 
cialized to regression by Friedman and Stuetzle (1981). Huber (1985) gives 
a scholarly overview, and Roosen and Hastie (1994) present a formulation 
using smoothing splines. The motivation for neural networks dates back 
to McCulloch and Pitts (1943), Widrow and Hoff (1960) (reprinted in An¬ 
derson and Rosenfeld (1988)) and Rosenblatt (1962). Hebb (1949) heavily 
influenced the development of learning algorithms. The resurgence of neural 
networks in the mid 1980s was due to Werbos (1974), Parker (1985) and 
Rumelhart et al. (1986), who proposed the back-propagation algorithm. 
Today there are many books written on the topic, for a broad range of 
audiences. For readers of this book, Hertz et al. (1991), Bishop (1995) and 
Ripley (1996) may be the most informative. Bayesian learning for neural 
networks is described in Neal (1996). The ZIP code example was taken from 
Le Cun (1989); see also Le Cun et al. (1990) and Le Cun et al. (1998). 

We do not discuss theoretical topics such as approximation properties of 
neural networks, such as the work of Barron (1993), Girosi et al. (1995) 
and Jones (1992). Some of these results are summarized by Ripley (1996). 


Exercises 


Ex. 11.1 Establish the exact correspondence between the projection pur¬ 
suit regression model (11.1) and the neural network (11.5). In particular, 
show that the single-layer regression network is equivalent to a PPR model 
with 5 m (w^a;) = f3 m a(ao m + s m (w^ x)), where is the mth unit vector. 
Establish a similar equivalence for a classification network. 

Ex. 11.2 Consider a neural network for a quantitative outcome as in (11.5), 
using squared-error loss and identity output function gk(t) = t. Suppose 
that the weights a m from the input to hidden layer are nearly zero. Show 
that the resulting model is nearly linear in the inputs. 

Ex. 11.3 Derive the forward and backward propagation equations for the 
cross-entropy loss function. 

Ex. 11.4 Consider a neural network for a K class outcome that uses cross¬ 
entropy loss. If the network has no hidden layer, show that the model is 
equivalent to the multinomial logistic model described in Chapter 4. 

Ex. 11.5 

(a) Write a program to fit a single hidden layer neural network (ten hidden 
units) via back-propagation and weight decay. 
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(b) Apply it to 100 observations from the model 

Y = a(afX) + (apf) 2 + 0.30 • Z , 

where a is the sigmoid function, Z is standard normal, X T = (Ah, Ah), 
each Xj being independent standard normal, and a\ = (3, 3), 02 = 
(3, —3). Generate a test sample of size 1000, and plot the training and 
test error curves as a function of the number of training epochs, for 
different values of the weight decay parameter. Discuss the overfitting 
behavior in each case. 

(c) Vary the number of hidden units in the network, from 1 up to 10, and 

determine the minimum number needed to perform well for this task. 

Ex. 11.6 Write a program to carry out projection pursuit regression, using 
cubic smoothing splines with fixed degrees of freedom. Fit it to the data 
from the previous exercise, for various values of the smoothing parameter 
and number of model terms. Find the minimum number of model terms 
necessary for the model to perform well and compare this to the number 
of hidden units from the previous exercise. 

Ex. 11.7 Fit a neural network to the spam data of Section 9.1.2, and compare 
the results to those for the additive model given in that chapter. Compare 
both the classification performance and interpretability of the final model. 
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Support Vector Machines and 
Flexible Discriminants 


12.1 Introduction 

In this chapter we describe generalizations of linear decision boundaries 
for classification. Optimal separating hyperplanes are introduced in Chap¬ 
ter 4 for the case when two classes are linearly separable. Here we cover 
extensions to the nonseparable case, where the classes overlap. These tech¬ 
niques are then generalized to what is known as the support vector machine , 
which produces nonlinear boundaries by constructing a linear boundary in 
a large, transformed version of the feature space. The second set of methods 
generalize Fisher’s linear discriminant analysis (LDA). The generalizations 
include flexible discriminant analysis which facilitates construction of non¬ 
linear boundaries in a manner very similar to the support vector machines, 
penalized discriminant analysis for problems such as signal and image clas¬ 
sification where the large number of features are highly correlated, and 
mixture discriminant analysis for irregularly shaped classes. 


12.2 The Support Vector Classifier 

In Chapter 4 we discussed a technique for constructing an optimal separat¬ 
ing hyperplane between two perfectly separated classes. We review this and 
generalize to the nonseparable case, where the classes may not be separable 
by a linear boundary. 
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FIGURE 12.1. Support vector classifiers. The left panel shows the separable 
case. The decision boundary is the solid line, while broken lines bound the shaded 
maximal margin of width 2 M = 2/||/3||. The right panel shows the nonseparable 
(overlap) case. The points labeled £* are on the wrong side of their margin by 
an amount = M£j; points on the correct side have = 0. The margin is 
maximized subject to a total budget Y& — constant. Hence Y is the total 
distance of points on the wrong side of their margin. 

Our training data consists of N pairs (xi, yi), (x 2 , 2 / 2 ),..., (x^r, Z/at), with 
Xi £ 1R P and yi £ {—1,1}. Define a hyperplane by 

{x ■■ /(x) =x t (3 + /3 0 = 0}, (12.1) 

where (3 is a unit vector: ||/3|| = 1. A classification rule induced by /(x) is 

G(x) = sign[x T /3 + f3 0 \- (12.2) 

The geometry of hyperplanes is reviewed in Section 4.5, where we show that 
/(x) in (12.1) gives the signed distance from a point x to the hyperplane 
/(x) = x T /3+/3o = 0. Since the classes are separable, we can find a function 
/(x) = x T f3 + /3 0 with yif{xi) > 0 Vi Hence we are able to find the 
hyperplane that creates the biggest margin between the training points for 
class 1 and —1 (see Figure 12.1). The optimization problem 

max M 

A/3o,ll/3|l = l ^2 3) 

subject to yi(xf + /?o) > M, i = 1,... ,N, 

captures this concept. The band in the figure is M units away from the 
hyperplane on either side, and hence 2 M units wide. It is called the margin. 

We showed that this problem can be more conveniently rephrased as 

min ||/3|| 

/3,/So 

subject to yi{xj /3 + /3q) >1, i = 1,... ,N, 


(12.4) 
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where we have dropped the norm constraint on /3. Note that M = 1/||/3||. 
Expression (12.4) is the usual way of writing the support vector criterion 
for separated data. This is a convex optimization problem (quadratic cri¬ 
terion, linear inequality constraints), and the solution is characterized in 
Section 4.5.2. 

Suppose now that the classes overlap in feature space. One way to deal 
with the overlap is to still maximize M, but allow for some points to be on 
the wrong side of the margin. Define the slack variables £ = (£i, £ 2 , • • ■, £n)- 
There are two natural ways to modify the constraint in (12.3): 

yi (xjp + p 0 ) > M — &, (12.5) 

or 

yi (xJl3 + po) > M( 1-&), (12.6) 

Vi, & > 0, i < constant. The two choices lead to different solutions. 

The first choice seems more natural, since it measures overlap in actual 
distance from the margin; the second choice measures the overlap in relative 
distance, which changes with the width of the margin M. However, the first 
choice results in a nonconvex optimization problem, while the second is 
convex; thus (12.6) leads to the “standard” support vector classifier, which 
we use from here on. 

Here is the idea of the formulation. The value £, in the constraint yi(xj(3+ 
/3o) > M(1 — £i) is the proportional amount by which the prediction 
f(xi) = xf /3+/3o is on the wrong side of its margin. Hence by bounding the 
sum £j, we bound the total proportional amount by which predictions 
fall on the wrong side of their margin. Misclassifications occur when £» > 1, 
so bounding ^ at a value K say, bounds the total number of training 
misclassifications at K. 

As in (4.48) in Section 4.5.2, we can drop the norm constraint on /3, 
define M = 1/||/3||, and write (12.4) in the equivalent form 


min 11/311 subject to 


yi(xfl3 + Po ) > 1-& Vi, 
A > 0, — constant. 


(12.7) 


This is the usual way the support vector classifier is defined for the non- 
separable case. However we find confusing the presence of the fixed scale 
“1” in the constraint yi(xj(3 + /3q) > 1 — and prefer to start with (12.6). 
The right panel of Figure 12.1 illustrates this overlapping case. 

By the nature of the criterion (12.7), we see that points well inside their 
class boundary do not play a big role in shaping the boundary. This seems 
like an attractive property, and one that differentiates it from linear dis¬ 
criminant analysis (Section 4.3). In LDA, the decision boundary is deter¬ 
mined by the covariance of the class distributions and the positions of the 
class centroids. We will see in Section 12.3.3 that logistic regression is more 
similar to the support vector classifier in this regard. 
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12.2.1 Computing the Support Vector Classifier 

The problem (12.7) is quadratic with linear inequality constraints, hence it 
is a convex optimization problem. We describe a quadratic programming 
solution using Lagrange multipliers. Computationally it is convenient to 
re-express (12.7) in the equivalent form 



1 N 

min-pr + C^ 


S,So 2 

2=1 

subject to & > 0, yfixf /3 + /3 0 ) > 1 - & Vi, 


( 12 . 8 ) 


where the “cost” parameter C replaces the constant in (12.7); the separable 
case corresponds to C = oo. 

The Lagrange (primal) function is 

N N N 

l p = g m 2 +<? E & - E “iivifri. p+m - (i - &)} - E ^ ( 12 - 9 ) 

2 = 1 2—1 2 = 1 


which we minimize w.r.t /3, /3q and Setting the respective derivatives to 
zero, we get 

N 

/3 = E a iDi X i > (12.10) 

2=1 
N 

o = E^> ( 12 - n ) 

2=1 

at = C - Pi, Vi, (12.12) 


as well as the positivity constraints ati, Pi, > 0 Vi. By substituting 
(12.10)-(12.12) into (12.9), we obtain the Lagrangian (Wolfe) dual objec¬ 
tive function 

N N N 

Ld = E ai ~ 2 E E a i a i'yiyi' x Jxi ', (12.13) 

2=1 2=1 i' = l 

which gives a lower bound on the objective function (12.8) for any feasible 
point. We maximize Lp subject to 0 < a* < C and JT =1 a iVi = 0- I 11 
addition to (12.10)-(12.12), the Karush-Kuhn-Tucker conditions include 
the constraints 


otilyfixf P + p 0 ) - (1 — C*)] = 0, 
yfixjp + p o)-(i-Ci) > o, 


(12.14) 

(12.15) 

(12.16) 


for i = 1,... ,1V. Together these equations (12.10)-(12.16) uniquely char¬ 
acterize the solution to the primal and dual problem. 
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From (12.10) we see that the solution for p has the form 

N 

P = '^2,a i y i Xi, (12.17) 

i=1 

with nonzero coefficients &i only for those observations i for which the 
constraints in (12.16) are exactly met (due to (12.14)). These observations 
are called the support vectors , since p is represented in terms of them 
alone. Among these support points, some will lie on the edge of the margin 
(£i = 0), and hence from (12.15) and (12.12) will be characterized by 
0 < &i < C; the remainder (£,; > 0) have a* = C. From (12.14) we can 
see that any of these margin points (0 < dj, ip = 0) can be used to solve 
for fto, and we typically use an average of all the solutions for numerical 
stability. 

Maximizing the dual (12.13) is a simpler convex quadratic programming 
problem than the primal (12.9), and can be solved with standard techniques 
(Murray et ah, 1981, for example). 

Given the solutions Po and /?, the decision function can be written as 
G(x) = sign [/(a:)] 

= sign[a : t P + Po). (12.18) 

The tuning parameter of this procedure is the cost parameter C. 

12.2.2 Mixture Example (Continued) 

Figure 12.2 shows the support vector boundary for the mixture example 
of Figure 2.5 on page 21, with two overlapping classes, for two different 
values of the cost parameter C. The classifiers are rather similar in their 
performance. Points on the wrong side of the boundary are support vectors. 
In addition, points on the correct side of the boundary but close to it (in 
the margin), are also support vectors. The margin is larger for C = 0.01 
than it is for C = 10, 000. Hence larger values of C focus attention more 
on (correctly classified) points near the decision boundary, while smaller 
values involve data further away. Either way, misclassified points are given 
weight, no matter how far away. In this example the procedure is not very 
sensitive to choices of C, because of the rigidity of a linear boundary. 

The optimal value for C can be estimated by cross-validation, as dis¬ 
cussed in Chapter 7. Interestingly, the leave-one-out cross-validation error 
can be bounded above by the proportion of support points in the data. The 
reason is that leaving out an observation that is not a support vector will 
not change the solution. Hence these observations, being classified correctly 
by the original boundary, will be classified correctly in the cross-validation 
process. However this bound tends to be too high, and not generally useful 
for choosing C (62% and 85%, respectively, in our examples). 
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C = 10000 



FIGURE 12.2. The linear support vector boundary for the mixture data exam¬ 
ple with two overlapping classes, for two different values of C. The broken lines 
indicate the margins, where f(x) = ±1. The support points (cni > 0) are all the 
points on the wrong side of their margin. The black solid dots are those support 
points falling exactly on the margin (& = 0, ai > 0). In the upper panel 62% of 
the observations are support points, while in the lower panel 85% are. The broken 
purple curve in the background is the Bayes decision boundary. 
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12.3 Support Vector Machines and Kernels 

The support vector classifier described so far finds linear boundaries in the 
input feature space. As with other linear methods, we can make the pro¬ 
cedure more flexible by enlarging the feature space using basis expansions 
such as polynomials or splines (Chapter 5). Generally linear boundaries 
in the enlarged space achieve better training-class separation, and trans¬ 
late to nonlinear boundaries in the original space. Once the basis functions 
h m (x), m = 1,..., M are selected, the procedure is the same as before. We 
fit the SV classifier using input features hfixf) = (/ii(x*), hfixf ),..., /iM(Xj)), 
i = 1,..., N, and produce the (nonlinear) function f(x) = h(x) T (3 + $o- 
The classifier is G(x) = sign(/(x)) as before. 

The support vector machine classifier is an extension of this idea, where 
the dimension of the enlarged space is allowed to get very large, infinite 
in some cases. It might seem that the computations would become pro¬ 
hibitive. It would also seem that with sufficient basis functions, the data 
would be separable, and overfitting would occur. We first show how the 
SVM technology deals with these issues. We then see that in fact the SVM 
classifier is solving a function-fitting problem using a particular criterion 
and form of regularization, and is part of a much bigger class of problems 
that includes the smoothing splines of Chapter 5. The reader may wish 
to consult Section 5.8, which provides background material and overlaps 
somewhat with the next two sections. 

12.3.1 Computing the SVM for Classification 

We can represent the optimization problem (12.9) and its solution in a 
special way that only involves the input features via inner products. We do 
this directly for the transformed feature vectors h{xi). We then see that for 
particular choices of h, these inner products can be computed very cheaply. 

The Lagrange dual function (12.13) has the form 



(12.19) 


From (12.10) we see that the solution function /(x) can be written 


/(x) = h{x) T /3 + 3o 


N 


= ^ aiyi(h(x),h(xi)) + /3 0 . 


( 12 . 20 ) 


As before, given a i; /3o can be determined by solving yif{xi) = 1 in (12.20) 
for any (or all) Xi for which 0 < < C. 
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So both (12.19) and (12.20) involve h(x) only through inner products. In 
fact, we need not specify the transformation h(x) at all, but require only 
knowledge of the kernel function 

K(x, x') = (h(x), h(x')) (12.21) 

that computes inner products in the transformed space. K should be a 
symmetric positive (semi-) definite function; see Section 5.8.1. 

Three popular choices for K in the SVM literature are 

dth-Degree polynomial: K(x,x') = (1 + ( x,x ')) d , 

Radial basis: K(x,x') = exp(— 7 ||x — x'|| 2 ), (12.22) 

Neural network: I\(x,x') = tanh(fti(x, x') + k 2 ). 

Consider for example a feature space with two inputs X\ and X 2 , and a 
polynomial kernel of degree 2. Then 

X(X,X') = (1 + (X,X ')) 2 

= (1 + XjXj + X 2 X ') 2 

= 1 + 2X 1 X[ + 2 X 3 X 3 + (XiX( ) 2 + (X 2 X ') 2 + 2X!X(X 2 X'. 

(12.23) 

Then M = 6, and if we choose hi(X) = 1, h 2 (X) = y/2Xi, h^{X) = 
v/2X 2 , h 4 (X) = Xl, h 5 (X) = X|, and h 6 (X) = V2X x X 2 , then K(X, X') = 
(h(X), h(X')). From (12.20) we see that the solution can be written 

N 

f(x) = y t 6tiyiK(x, Xj) + p 0 . (12.24) 

»=1 

The role of the parameter C is clearer in an enlarged feature space, 
since perfect separation is often achievable there. A large value of C will 
discourage any positive fy, and lead to an overfit wiggly boundary in the 
original feature space; a small value of C will encourage a small value of 
||/3||, which in turn causes f(x) and hence the boundary to be smoother. 
Figure 12.3 show two nonlinear support vector machines applied to the 
mixture example of Chapter 2. The regularization parameter was chosen 
in both cases to achieve good test error. The radial basis kernel produces 
a boundary quite similar to the Bayes optimal boundary for this example; 
compare Figure 2.5. 

In the early literature on support vectors, there were claims that the 
kernel property of the support vector machine is unique to it and allows 
one to finesse the curse of dimensionality. Neither of these claims is true, 
and we go into both of these issues in the next three subsections. 
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SVM - Degree-4 Polynomial In Feature Space 




FIGURE 12.3. Two nonlinear SVMs for the mixture data. The upper plot uses 
a m degree polynomial kernel, the lower a radial basis kernel (with 7 = 1). In 
each case C was tuned to approximately achieve the best test error performance, 
and C = 1 worked well in both cases. The radial basis kernel performs the best 
(close to Bayes optimal), as might be expected given the data arise from mixtures 
of Gaussians. The broken purple curve in the background is the Bayes decision 
boundary. 
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yf 


FIGURE 12.4. The support vector loss function (hinge loss), compared to the 
negative log-likelihood loss (binomial deviance) for logistic regression, squared-er¬ 
ror loss, and a “Huberized” version of the squared hinge loss. All are shown as a 
function of yf rather than f, because of the symmetry between the y = +1 and 
y = — 1 case. The deviance and Huber have the same asymptotes as the SVM 
loss, but are rounded in the interior. All are scaled to have the limiting left-tail 
slope of — 1. 


12.3.2 The SVM as a Penalization Method 

With f(x) = h(x) T /3 + po, consider the optimization problem 

N A 

% lil k'52l 1 ~yif( x i)]+ +H ll^ll 2 (12.25) 

Po, P —, z 

i=l 

where the subscript “+” indicates positive part. This has the form loss + 
penalty , which is a familiar paradigm in function estimation. It is easy to 
show (Exercise 12.1) that the solution to (12.25), with A = 1 /C, is the 
same as that for ( 12 . 8 ). 

Examination of the “hinge” loss function L(y , /) = [1 — yf]+ shows that 
it is reasonable for two-class classification, when compared to other more 
traditional loss functions. Figure 12.4 compares it to the log-likelihood loss 
for logistic regression, as well as squared-error loss and a variant thereof. 
The (negative) log-likelihood or binomial deviance has similar tails as the 
SVM loss, giving zero penalty to points well inside their margin, and a 
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TABLE 12.1. The population minimizers for the different loss functions in Fig¬ 
ure 12.4- Logistic regression uses the binomial log-likelihood or deviance. Linear 
discriminant analysis (Exercise f.2) uses squared-error loss. The SVM hinge loss 
estimates the mode of the posterior class probabilities, whereas the others estimate 
a linear transformation of these probabilities. 


Loss Function 

L[y,f(x)\ 

Minimizing Function 

Binomial 

Deviance 

log[l + e -! ^V)] 

^ , Pr(Y = +l|a:) 

/(aO= lo g Pr(y = _ 1|a;) 

SVM Hinge 
Loss 

[i - yf{x)\+ 

f(x) = sign[Pr(V = +l\x) - §] 

Squared 

Error 

[y - fix)] 2 = [1 - yf(x)] 2 

f{x) = 2Pr (Y = +l|a;) - 1 

“Huberised” 

Square 

Hinge Loss 

-4 yf(x), yf(x) < -1 

[1 — yf(x)]+ otherwise 

f{x) = 2Pr(V = +l|a;) - 1 


linear penalty to points on the wrong side and far away. Squared-error, on 
the other hand gives a quadratic penalty, and points well inside their own 
margin have a strong influence on the model as well. The squared hinge 
loss L(y,f) = [1 - yf]\ is like the quadratic, except it is zero for points 
inside their margin. It still rises quadratically in the left tail, and will be 
less robust than hinge or deviance to misclassified observations. Recently 
Rosset and Zhu (2007) proposed a “Huberized” version of the squared hinge 
loss, which converts smoothly to a linear loss at yf = —1. 

We can characterize these loss functions in terms of what they are es¬ 
timating at the population level. We consider minimizing E L(Y, f(X)). 
Table 12.1 summarizes the results. Whereas the hinge loss estimates the 
classifier G(x) itself, all the others estimate a transformation of the class 
posterior probabilities. The “Huberized” square hinge loss shares attractive 
properties of logistic regression (smooth loss function, estimates probabili¬ 
ties), as well as the SVM hinge loss (support points). 

Formulation (12.25) casts the SVM as a regularized function estimation 
problem, where the coefficients of the linear expansion f(x) = /3 q + h(x) T /3 
are shrunk toward zero (excluding the constant). If h(x) represents a hierar¬ 
chical basis having some ordered structure (such as ordered in roughness), 












428 


12. Flexible Discriminants 


then the uniform shrinkage makes more sense if the rougher elements hj in 
the vector h have smaller norm. 

All the loss-functions in Table 12.1 except squared-error are so called 
“margin maximizing loss-functions” (Rosset et ah, 2004b). This means that 
if the data are separable, then the limit of in (12.25) as A —> 0 defines 
the optimal separating hyperplane 1 . 

12.3.3 Function Estimation and Reproducing Kernels 

Here we describe SVMs in terms of function estimation in reproducing 
kernel Hilbert spaces, where the kernel property abounds. This material is 
discussed in some detail in Section 5.8. This provides another view of the 
support vector classifier, and helps to clarify how it works. 

Suppose the basis h arises from the (possibly finite) eigen-expansion of 
a positive definite kernel K, 



K(x,x') = ^2 4>m{x)(l)m{x')5 m (12.26) 

m= 1 


and h m (x) = Vbrn&mix)- Then with 6 m = we can write (12.25) 

as 


A ,,0 


N 

■E 

i= 1 


1 - Vi{P o 




\ 00 n2 

_ V'' lilt 

+ 2 A ‘ 
^ u m 

_|_ m= 1 


(12.27) 


Now (12.27) is identical in form to (5.49) on page 169 in Section 5.8, and 
the theory of reproducing kernel Hilbert spaces described there guarantees 
a finite-dimensional solution of the form 


N 

f(x) = Pa +^2aiK(x,Xi). (12.28) 

i=1 


In particular we see there an equivalent version of the optimization crite¬ 
rion (12.19) [Equation (5.67) in Section 5.8.2; see also Wahba et al. (2000)], 


^ A 

5 lin E( 1 - Vif( X i))+ + o aTKa ’ (12.29) 

P 0 ,« , 2 

l—l 

where K is the N x N matrix of kernel evaluations for all pairs of training 
features (Exercise 12.2). 

These models are quite general, and include, for example, the entire fam¬ 
ily of smoothing splines, additive and interaction spline models discussed 


1 For logistic regression with separable data, diverges, but $\/\\$x converges to 
the optimal separating direction. 
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in Chapters 5 and 9, and in more detail in Wahba (1990) and Hastie and 
Tibshirani (1990). They can be expressed more generally as 

N 

- yj( x i)] + + \J(f), (12.30) 

i=i 


where % is the structured space of functions, and J(f) an appropriate reg- 
ularizer on that space. For example, suppose T~L is the space of additive 
functions f(x) = Y% = 1 fj(xj), and J(f) = Ej f {f'j( x j)} 2 dxj. Thcn thc 
solution to (12.30) is an additive cubic spline, and has a kernel representa¬ 
tion (12.28) with K(x,x') = Ej=i Kj( x j> x j)- Each of the Kj is the kernel 
appropriate for the univariate smoothing spline in Xj (Wahba, 1990). 

Conversely this discussion also shows that, for example, any of the kernels 
described in (12.22) above can be used with any convex loss function, and 
will also lead to a finite-dimensional representation of the form (12.28). 
Figure 12.5 uses the same kernel functions as in Figure 12.3, except using 
the binomial log-likelihood as a loss function 2 . The fitted function is hence 
an estimate of the log-odds, 


f( x ) 


Pr(F = +l|a:) 

log —-— 

Pr(V = —l|x) 

N 


or conversely we get an estimate of the class probabilities 


Pr(y = +l\x) 


1 

1 q. e ~$o-J2f=i &iK(x,Xi) 


(12.31) 


(12.32) 


The fitted models are quite similar in shape and performance. Examples 
and more details are given in Section 5.8. 

It does happen that for SVMs, a sizable fraction of the N values of cq 
can be zero (the nonsupport points). In the two examples in Figure 12.3, 
these fractions are 42% and 45%, respectively. This is a consequence of the 
piecewise linear nature of the first part of the criterion (12.25). The lower 
the class overlap (on the training data), the greater this fraction will be. 
Reducing A will generally reduce the overlap (allowing a more flexible /). 
A small number of support points means that /( x) can be evaluated more 
quickly, which is important at lookup time. Of course, reducing the overlap 
too much can lead to poor generalization. 


- Ji Zhu assisted in the preparation of these examples. 
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LR - Radial Kernel in Feature Space 



FIGURE 12.5. The logistic regression versions of the SVM models in Fig¬ 
ure 12.3, using the identical kernels and hence penalties, but the log-likelihood 
loss instead of the SVM loss function. The two broken contours correspond to 
posterior probabilities of 0.75 and 0.25 for the +1 class (or vice versa). The bro¬ 
ken purple curve in the background is the Bayes decision boundary. 
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TABLE 12.2. Skin of the orange: Shown are mean (standard error of the mean) 
of the test error over 50 simulations. BRUTO fits an additive spline model adap¬ 
tively, while MARS fits a low-order interaction model adaptively. 


Method 

Test Error (SE) 

No Noise Features 

Six Noise Features 

1 

SV Classifier 

0.450 (0.003) 

0.472 (0.003) 

2 

SVM/poly 2 

0.078 (0.003) 

0.152 (0.004) 

3 

SVM/poly 5 

0.180 (0.004) 

0.370 (0.004) 

4 

SVM/poly 10 

0.230 (0.003) 

0.434 (0.002) 

5 

BRUTO 

0.084 (0.003) 

0.090 (0.003) 

6 

MARS 

0.156 (0.004) 

0.173 (0.005) 


Bayes 

0.029 

0.029 


12.3.4 SVMs and the Curse of Dimensionality 

In this section, we address the question of whether SVMs have some edge 
on the curse of dimensionality. Notice that in expression (12.23) we are not 
allowed a fully general inner product in the space of powers and products. 
For example, all terms of the form 2 XjX!- are given equal weight, and the 
kernel cannot adapt itself to concentrate on subspaces. If the number of 
features p were large, but the class separation occurred only in the linear 
subspace spanned by say X\ and X 2 , this kernel would not easily find the 
structure and would suffer from having many dimensions to search over. 
One would have to build knowledge about the subspace into the kernel; 
that is, tell it to ignore all but the first two inputs. If such knowledge were 
available a priori, much of statistical learning would be made much easier. 
A major goal of adaptive methods is to discover such structure. 

We support these statements with an illustrative example. We generated 
100 observations in each of two classes. The first class has four standard 
normal independent features X±, X2, A 3 , X4. The second class also has four 
standard normal independent features, but conditioned on 9 < Aj — 16. 
This is a relatively easy problem. As a second harder problem, we aug¬ 
mented the features with an additional six standard Gaussian noise fea¬ 
tures. Hence the second class almost completely surrounds the first, like the 
skin surrounding the orange, in a four-dimensional subspace. The Bayes er¬ 
ror rate for this problem is 0.029 (irrespective of dimension). We generated 
1000 test observations to compare different procedures. The average test 
errors over 50 simulations, with and without noise features, are shown in 
Table 12.2. 

Line 1 uses the support vector classifier in the original feature space. 
Lines 2-4 refer to the support vector machine with a 2-, 5- and 10-dimension¬ 
al polynomial kernel. For all support vector procedures, we chose the cost 
parameter C to minimize the test error, to be as fair as possible to the 
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Test Error Curves - SVM with Radial Kernel 

7 = 5 7= 1 7 = 0.5 7 = 0.1 



C 

FIGURE 12.6. Test-error curves as a function of the cost parameter C for the 
radial-kernel SVM classifier on the mixture data. At the top of each plot is the 
scale parameter 7 for the radial kernel: K-/(x, y ) = exp — 7 ||a; — t/|| 2 . The optimal 
value for C depends quite strongly on the scale of the kernel. The Bayes error 
rate is indicated by the broken horizontal lines. 

method. Line 5 fits an additive spline model to the (—1,+1) response by 
least squares, using the BRUTO algorithm for additive models, described 
in Hastie and Tibshirani (1990). Line 6 uses MARS (multivariate adaptive 
regression splines) allowing interaction of all orders, as described in Chap¬ 
ter 9; as such it is comparable with the SVM/poly 10. Both BRUTO and 
MARS have the ability to ignore redundant variables. Test error was not 
used to choose the smoothing parameters in either of lines 5 or 6. 

In the original feature space, a hyperplane cannot separate the classes, 
and the support vector classifier (line 1) does poorly. The polynomial sup¬ 
port vector machine makes a substantial improvement in test error rate, 
but is adversely affected by the six noise features. It is also very sensitive to 
the choice of kernel: the second degree polynomial kernel (line 2) does best, 
since the true decision boundary is a second-degree polynomial. However, 
higher-degree polynomial kernels (lines 3 and 4) do much worse. BRUTO 
performs well, since the boundary is additive. BRUTO and MARS adapt 
well: their performance does not deteriorate much in the presence of noise. 

12.3.5 A Path Algorithm for the SVM Classifier 

The regularization parameter for the SVM classifier is the cost parameter 
C, or its inverse A in (12.25). Common usage is to set C high, leading often 
to somewhat overfit classifiers. 

Figure 12.6 shows the test error on the mixture data as a function of 
C , using different radial-kernel parameters 7. When 7 = 5 (narrow peaked 
kernels), the heaviest regularization (small C ) is called for. With 7 = 1 
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ai(A) 



FIGURE 12.7. A simple example illustrates the SVM path algorithm, (left 
panel:) This plot illustrates the state of the model at A = 0.5. The “ + 1” 
points are orange, the 1” blue. A = 1/2, and the width of the soft margin 
is 2/||/3|| = 2 x 0.587. Two blue points {3,5} are mis classified, while the two or¬ 
ange points {10,12} are correctly classified, but on the wrong side of their margin 
f(x) = +1; each of these has yif{xi) < 1. The three square shaped points {2, 6, 7} 
are exactly on their margins, (right panel:) This plot shows the piecewise linear 
profiles Oi(A). The horizontal broken line at A = 1/2 indicates the state of the at 
for the model in the left plot. 


(the value used in Figure 12.3), an intermediate value of C is required. 
Clearly in situations such as these, we need to determine a good choice 
for C, perhaps by cross-validation. Here we describe a path algorithm (in 
the spirit of Section 3.8) for efficiently fitting the entire sequence of SVM 
models obtained by varying C. 

It is convenient to use the loss+penalty formulation (12.25), along with 
Figure 12.4. This leads to a solution for f3 at a given value of A: 

1 N 

f3\ = J a t-yi x i- (12.33) 

i—1 

The are again Lagrange multipliers, but in this case they all lie in [0,1]. 

Figure 12.7 illustrates the setup. It can be shown that the KKT optimal¬ 
ity conditions imply that the labeled points ( Xi,yi ) fall into three distinct 
groups: 
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• Observations correctly classified and outside their margins. They have 
Vif(xi) > 1, and Lagrange multipliers cti = 0. Examples are the 
orange points 8, 9 and 11, and the blue points 1 and 4. 

• Observations sitting on their margins with yif(xi) = 1, with Lagrange 
multipliers cti £ [0,1]. Examples are the orange 7 and the blue 2 and 


8 


• Observations inside their margins have yif(xi) < 1, with cq = 1. 

Examples are the blue 3 and 5, and the orange 10 and 12. 

The idea for the path algorithm is as follows. Initially A is large, the 
margin 1/||/3,\|| is wide, and all points are inside their margin and have 
at = 1. As A decreases, 1 /||/?a|| decreases, and the margin gets narrower. 
Some points will move from inside their margins to outside their margins, 
and their on will change from 1 to 0. By continuity of the af A), these points 
will linger on the margin during this transition. From (12.33) we see that 
the points with on = 1 make fixed contributions to /3(A), and those with 
cti = 0 make no contribution. So all that changes as A decreases are the 
ai £ [0,1] of those (small number) of points on the margin. Since all these 
points have yif(xi) = 1, this results in a small set of linear equations that 
prescribe how a^(A) and hence f3\ changes during these transitions. This 
results in piecewise linear paths for each of the a,(A). The breaks occur 
when points cross the margin. Figure 12.7 (right panel) shows the ai(A) 
profiles for the small example in the left panel. 

Although we have described this for linear SVMs, exactly the same idea 
works for nonlinear models, in which (12.33) is replaced by 



(12.34) 


Details can be found in Hastie et al. (2004). An R package svmpath is 
available on CRAN for fitting these models. 

12.3.6 Support Vector Machines for Regression 

In this section we show how SVMs can be adapted for regression with a 
quantitative response, in ways that inherit some of the properties of the 
SVM classifier. We first discuss the linear regression model 


f(x) = x t (3 + ho 


(12.35) 


and then handle nonlinear generalizations. To estimate /3, we consider min¬ 
imization of 



(12.36) 
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FIGURE 12.8. The left panel shows the e-insensitive error function used by the 
support vector regression machine. The right panel shows the error function used 
in Huber’s robust regression (blue curve). Beyond |c|, the function changes from 
quadratic to linear. 


where 



(12.37) 


This is an “e-insensitive” error measure, ignoring errors of size less than 
e (left panel of Figure 12.8). There is a rough analogy with the support 
vector classification setup, where points on the correct side of the deci¬ 
sion boundary and far away from it, are ignored in the optimization. In 
regression, these “low error” points are the ones with small residuals. 

It is interesting to contrast this with error measures used in robust re¬ 
gression in statistics. The most popular, due to Huber (1964), has the form 



(12.38) 


shown in the right panel of Figure 12.8. This function reduces from quadratic 
to linear the contributions of observations with absolute residual greater 
than a prechosen constant c. This makes the fitting less sensitive to out¬ 
liers. The support vector error measure (12.37) also has linear tails (beyond 
e), but in addition it flattens the contributions of those cases with small 
residuals. 

If /3, (do are the minimizers of H, the solution function can be shown to 
have the form 


N 



(12.39) 


i=i 

N 


f( x ) = -&i)(x,Xi) + Po, 


(12.40) 













436 


12. Flexible Discriminants 


where &i,a* are positive and solve the quadratic programming problem 

N N i JV 

+ «*) ~J2Vi( a i ~ a i) + o E ( a i _ “ Ui')(Xi,Xi') 

a.i ,ot* zz ' Z z —' 

2 = 1 2=1 2 , 2 ' = 1 

subject to the constraints 

0 < ai, a* < 1/A, 

N 

E« ^ <*) = °> ( 12 ‘ 41 ) 

i=l 

a,a* = 0. 

Due to the nature of these constraints, typically only a subset of the solution 
values (a* — &i) are nonzero, and the associated data values are called the 
support vectors. As was the case in the classification setting, the solution 
depends on the input values only through the inner products (xi,Xi'). Thus 
we can generalize the methods to richer spaces by defining an appropriate 
inner product, for example, one of those defined in (12.22). 

Note that there are parameters, e and A, associated with the criterion 
(12.36). These seem to play different roles, e is a parameter of the loss 
function V e , just like c is for Vjj. Note that both V e and Vr depend on the 
scale of y and hence r. If we scale our response (and hence use Vh(t/<t) and 
V € (r/<j) instead), then we might consider using preset values for c and e (the 
value c = 1.345 achieves 95% efficiency for the Gaussian). The quantity A 
is a more traditional regularization parameter, and can be estimated for 
example by cross-validation. 

12.3.7 Regression and Kernels 

As discussed in Section 12.3.3, this kernel property is not unique to sup¬ 
port vector machines. Suppose we consider approximation of the regression 
function in terms of a set of basis functions {h m {x)}, m = 1, 2,... , M: 

M 

f(x) = ^2 Prnhm{x) + /3 0 - (12.42) 

m= 1 

To estimate /? and (3q we minimize 

N A 

H(P, Po) = E - /(*i)) +2 EX ( 12 - 43 ) 

2=1 

for some general error measure V(r). For any choice of V(r), the solution 
f( x ) = S Pmh m (x) + Po has the form 

N 

f{x) ='^2 / d i K(x,x i ) 

2=1 


(12.44) 
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with K(x,y) = ^m=i h rn (x)h m (y). Notice that this has the same form 
as both the radial basis function expansion and a regularization estimate, 
discussed in Chapters 5 and 6. 

For concreteness, let’s work out the case V (r) = r 2 . Let H be the N x M 
basis matrix with imth element h m {xi ), and suppose that M > N is large. 
For simplicity we assume that /3o = 0, or that the constant is absorbed in 
h; see Exercise 12.3 for an alternative. 

We estimate /? by minimizing the penalized least squares criterion 

H(P) = (y - H/3) T (y - H/3) + A||/3|| 2 . (12.45) 


The solution is 


y = h/3 


(12.46) 


with /3 determined by 


-H T (y - H/3) + A/3 = 0. (12.47) 

From this it appears that we need to evaluate the M x M matrix of inner 
products in the transformed space. However, we can premultiply by H to 
give 


H/3 = (HH t + AI)' 1 HH T y. (12.48) 

The N x N matrix HH T consists of inner products between pairs of obser¬ 
vations i, i'; that is, the evaluation of an inner product kernel {HH t } m , = 
K(xi,Xi>). It is easy to show (12.44) directly in this case, that the predicted 
values at an arbitrary x satisfy 

f{x) = h(x) T (3 

N 

= &iK(x, Xi), (12.49) 

2 = 1 

where a = (HH T + AI) _1 y. As in the support vector machine, we need not 
specify or evaluate the large set of functions hi(x), h 2 ( 2 ),..., Only 

the inner product kernel K(xi,Xi>) need be evaluated, at the N training 
points for each i, i' and at points x for predictions there. Careful choice 
of h m (such as the eigenfunctions of particular, easy-to-evaluate kernels 
K) means, for example, that HH T can be computed at a cost of N 2 /2 
evaluations of K, rather than the direct cost N 2 M. 

Note, however, that this property depends on the choice of squared norm 
||/3|| 2 in the penalty. It does not hold, for example, for the L\ norm |/3|, 
which may lead to a superior model. 
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12.3.8 Discussion 

The support vector machine can be extended to multiclass problems, es¬ 
sentially by solving many two-class problems. A classifier is built for each 
pair of classes, and the final classifier is the one that dominates the most 
(Kressel, 1999; Friedman, 1996; Hastie and Tibshirani, 1998). Alternatively, 
one could use the multinomial loss function along with a suitable kernel, 
as in Section 12.3.3. SVMs have applications in many other supervised 
and unsupervised learning problems. At the time of this writing, empirical 
evidence suggests that it performs well in many real learning problems. 

Finally, we mention the connection of the support vector machine and 
structural risk minimization (7.9). Suppose the training points (or their 
basis expansion) are contained in a sphere of radius R , and let G(x ) = 
sign[/(ir)] = sign[/3 T x + /3 q] as in (12.2). Then one can show that the class 
of functions {G(a;), ||/3|| < A} has VC-dimension h satisfying 

h < R 2 A 2 . (12.50) 

If f(x) separates the training data, optimally for ||/?|| < A, then with 
probability at least 1 — rj over training sets (Vapnik, 1996, page 139): 

E „o rT „,<4tM?^±Ad^M). (12.51, 

The support vector classifier was one of the first practical learning pro¬ 
cedures for which useful bounds on the VC dimension could be obtained, 
and hence the SRM program could be carried out. However in the deriva¬ 
tion, balls are put around the data points—a process that depends on the 
observed values of the features. Hence in a strict sense, the VC complexity 
of the class is not fixed a priori, before seeing the features. 

The regularization parameter C controls an upper bound on the VC 
dimension of the classifier. Following the SRM paradigm, we could choose C 
by minimizing the upper bound on the test error, given in (12.51). However, 
it is not clear that this has any advantage over the use of cross-validation 
for choice of C. 


12.4 Generalizing Linear Discriminant Analysis 

In Section 4.3 we discussed linear discriminant analysis (LDA), a funda¬ 
mental tool for classification. For the remainder of this chapter we discuss 
a class of techniques that produce better classifiers than LDA by directly 
generalizing LDA. 

Some of the virtues of LDA are as follows: 

• It is a simple prototype classifier. A new observation is classified to the 
class with closest centroid. A slight twist is that distance is measured 
in the Mahalanobis metric, using a pooled covariance estimate. 
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• LDA is the estimated Bayes classifier if the observations are multi¬ 
variate Gaussian in each class, with a common covariance matrix. 
Since this assumption is unlikely to be true, this might not seem to 
be much of a virtue. 

• The decision boundaries created by LDA are linear, leading to deci¬ 
sion rules that are simple to describe and implement. 

• LDA provides natural low-dimensional views of the data. For exam¬ 
ple, Figure 12.12 is an informative two-dimensional view of data in 
256 dimensions with ten classes. 

• Often LDA produces the best classification results, because of its 
simplicity and low variance. LDA was among the top three classifiers 
for 7 of the 22 datasets studied in the STATLOG project (Michie et 
ah, 1994) 3 . 

Unfortunately the simplicity of LDA causes it to fail in a number of situa¬ 
tions as well: 

• Often linear decision boundaries do not adequately separate the classes. 
When N is large, it is possible to estimate more complex decision 
boundaries. Quadratic discriminant analysis (QDA) is often useful 
here, and allows for quadratic decision boundaries. More generally 
we would like to be able to model irregular decision boundaries. 

• The aforementioned shortcoming of LDA can often be paraphrased 
by saying that a single prototype per class is insufficient. LDA uses 
a single prototype (class centroid) plus a common covariance matrix 
to describe the spread of the data in each class. In many situations, 
several prototypes are more appropriate. 

• At the other end of the spectrum, we may have way too many (corre¬ 
lated) predictors, for example, in the case of digitized analogue signals 
and images. In this case LDA uses too many parameters, which are 
estimated with high variance, and its performance suffers. In cases 
such as this we need to restrict or regularize LDA even further. 

In the remainder of this chapter we describe a class of techniques that 
attend to all these issues by generalizing the LDA model. This is achieved 
largely by three different ideas. 

The first idea is to recast the LDA problem as a linear regression problem. 
Many techniques exist for generalizing linear regression to more flexible, 
nonparametric forms of regression. This in turn leads to more flexible forms 
of discriminant analysis, which we call FDA. In most cases of interest, the 


3 This study predated the emergence of SVMs. 
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regression procedures can be seen to identify an enlarged set of predictors 
via basis expansions. FDA amounts to LDA in this enlarged space, the 
same paradigm used in SVMs. 

In the case of too many predictors, such as the pixels of a digitized image, 
we do not want to expand the set: it is already too large. The second idea is 
to fit an LDA model, but penalize its coefficients to be smooth or otherwise 
coherent in the spatial domain, that is, as an image. We call this procedure 
penalized discriminant analysis or PDA. With FDA itself, the expanded 
basis set is often so large that regularization is also required (again as in 
SVMs). Both of these can be achieved via a suitably regularized regression 
in the context of the FDA model. 

The third idea is to model each class by a mixture of two or more Gaus- 
sians with different centroids, but with every component Gaussian, both 
within and between classes, sharing the same covariance matrix. This allows 
for more complex decision boundaries, and allows for subspace reduction 
as in LDA. We call this extension mixture discriminant analysis or MDA. 

All three of these generalizations use a common framework by exploiting 
their connection with LDA. 


12.5 Flexible Discriminant Analysis 

In this section we describe a method for performing LDA using linear re¬ 
gression on derived responses. This in turn leads to nonparametric and flex¬ 
ible alternatives to LDA. As in Chapter 4, we assume we have observations 
with a quantitative response G falling into one of I\ classes Q — (1,..., K}, 
each having measured features X. Suppose 6 : Q H \ IR 1 is a function that 
assigns scores to the classes, such that the transformed class labels are op¬ 
timally predicted by linear regression on X: If our training sample has the 
form ( gi,Xi ), i = 1, 2,..., N, then we solve 

N 

min (%i) - x fP) 2 » (12.52) 

2=1 

with restrictions on 9 to avoid a trivial solution (mean zero and unit vari¬ 
ance over the training data). This produces a one-dinrensional separation 
between the classes. 

More generally, we can find up to L < K — 1 sets of independent scorings 
for the class labels, 6i, 6 2 , ■ ■ ■, 0 Ll and L corresponding linear maps rje(X) = 
X T pt, i = 1,..., L, chosen to be optimal for multiple regression in IR P . The 
scores 6e(g) and the maps pa are chosen to minimize the average squared 
residual, 

^ l v N 

asr= n E E(W-zfA ) 2 

1=1 Li=l 


(12.53) 




12.5 Flexible Discriminant Analysis 441 


The set of scores are assumed to be mutually orthogonal and normalized 
with respect to an appropriate inner product to prevent trivial zero 
solutions. 

Why are we going down this road? It can be shown that the sequence 
of discriminant (canonical) vectors ve derived in Section 4.3.3 are identical 
to the sequence fie up to a constant (Mardia et ah, 1979; Hastie et ah, 
1995). Moreover, the Mahalanobis distance of a test point x to the £;th 
class centroid frk is given by 


K -1 

8j(x, fik) = ^2 w dve{ x ) ~ Ve) 2 + D(x), (12.54) 

i=i 


where 77 ^ is the mean of the f]e{xi) in the fcth class, and D(x) does not 
depend on k. Here we are coordinate weights that are defined in terms of 
the mean squared residual rf of the £th optimally scored fit 


we 


1 

r?(l - r2 eY 


(12.55) 


In Section 4.3.2 we saw that these canonical distances are all that is needed 
for classification in the Gaussian setup, with equal covariances in each class. 
To summarize: 


LDA can be performed by a sequence of linear regressions, fol¬ 
lowed by classification to the closest class centroid in the space 
of fits. The analogy applies both to the reduced rank version, 
or the full rank case when L = K — 1. 


The real power of this result is in the generalizations that it invites. We 
can replace the linear regression fits rje(x) = x T /3e by far more flexible, 
nonparametric fits, and by analogy achieve a more flexible classifier than 
LDA. We have in mind generalized additive fits, spline functions, MARS 
models and the like. In this more general form the regression problems are 
defined via the criterion 


ASR({0e,m}e= 1 ) 


1 

N 


L 


E 


' N 

^2(0t(9i) - w(xi)) 2 + AJ(%) , 

j=1 


(12.56) 


where J is a regularizer appropriate for some forms of nonparametric regres¬ 
sion, such as smoothing splines, additive splines and lower-order ANOVA 
spline models. Also included are the classes of functions and associated 
penalties generated by kernels, as in Section 12.3.3. 

Before we describe the computations involved in this generalization, let 
us consider a very simple example. Suppose we use degree-2 polynomial 
regression for each rje. The decision boundaries implied by the (12.54) will 
be quadratic surfaces, since each of the fitted functions is quadratic, and as 
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FIGURE 12.9. The data consist of 50 points generated from each ofN(0,I) and 
1V(0, |/). The solid black ellipse is the decision boundary found by FDA using 
degree-two polynomial regression. The dashed purple circle is the Bayes decision 
boundary. 


in LDA their squares cancel out when comparing distances. We could have 
achieved identical quadratic boundaries in a more conventional way, by 
augmenting our original predictors with their squares and cross-products. 
In the enlarged space one performs an LDA, and the linear boundaries in 
the enlarged space map down to quadratic boundaries in the original space. 
A classic example is a pair of multivariate Gaussians centered at the origin, 
one having covariance matrix /, and the other cl for c > 1; Figure 12.9 
illustrates. The Bayes decision boundary is the sphere ||x|| = 2 (^ 1 ) > w hich 
is a linear boundary in the enlarged space. 

Many nonparametric regression procedures operate by generating a basis 
expansion of derived variables, and then performing a linear regression in 
the enlarged space. The MARS procedure (Chapter 9) is exactly of this 
form. Smoothing splines and additive spline models generate an extremely 
large basis set (N x p basis functions for additive splines), but then perform 
a penalized regression fit in the enlarged space. SVMs do as well; see also 
the kernel-based regression example in Section 12.3.7. FDA in this case can 
be shown to perform a penalized linear discriminant analysis in the enlarged 
space. We elaborate in Section 12.6. Linear boundaries in the enlarged space 
map down to nonlinear boundaries in the reduced space. This is exactly the 
same paradigm that is used with support vector machines (Section 12.3). 

We illustrate FDA on the speech recognition example used in Chapter 
4.), with K = 11 classes and p = 10 predictors. The classes correspond to 
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Linear Discriminant Analysis 



Coordinate 1 for Training Data 


Flexible Discriminant Analysis - Bruto 



FIGURE 12.10. The left plot shows the first two LDA canonical variates for 
the vowel training data. The right plot shows the corresponding projection when 
FDA/BRUTO is used to fit the model; plotted are the fitted regression functions 
r}i(xi) and 772(2:4). Notice the improved separation. The colors represent the eleven 
different vowel sounds. 


11 vowel sounds, each contained in 11 different words. Here are the words, 
preceded by the symbols that represent them: 


Vowel 

Word 

Vowel 

Word 

Vowel 

Word 

Vowel 

Word 

i: 

heed 

0 

hod 

I 

hid 

C: 

hoard 

E 

head 

u 

hood 

A 

had 

u: 

who’d 

a: 

hard 

3: 

heard 

Y 

hud 




Each of eight speakers spoke each word six times in the training set, and 
likewise seven speakers in the test set. The ten predictors are derived from 
the digitized speech in a rather complicated way, but standard in the speech 
recognition world. There are thus 528 training observations, and 462 test 
observations. Figure 12.10 shows two-dimensional projections produced by 
LDA and FDA. The FDA model used adaptive additive-spline regression 
functions to model the 77^(2;), and the points plotted in the right plot have 
coordinates 771(2:4) and 772(2:4). The routine used in S-PLUS is called bruto, 
hence the heading on the plot and in Table 12.3. We see that flexible model¬ 
ing has helped to separate the classes in this case. Table 12.3 shows training 
and test error rates for a number of classification techniques. FDA/MARS 
refers to Friedman’s multivariate adaptive regression splines; degree = 2 
means pairwise products are permitted. Notice that for FDA/MARS, the 
best classification results are obtained in a reduced-rank subspace. 
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TABLE 12.3. Vowel recognition data performance results. The results for neural 
networks are the best among a much larger set, taken from a neural network 
archive. The notation FDA/BRUTO refers to the regression method used with 
FDA. 



Technique 

Error Rates 
Training Test 

(1) 

LDA 

0.32 

0.56 


Softmax 

0.48 

0.67 

(2) 

QDA 

0.01 

0.53 

(3) 

CART 

0.05 

0.56 

(4) 

CART (linear combination splits) 

0.05 

0.54 

(5) 

Single-layer perceptron 


0.67 

(6) 

Multi-layer perceptron (88 hidden units) 


0.49 

(7) 

Gaussian node network (528 hidden units) 


0.45 

(8) 

Nearest neighbor 


0.44 

(9) 

FDA/BRUTO 

0.06 

0.44 


Softmax 

0.11 

0.50 

(10) 

FDA/MARS (degree = 1) 

0.09 

0.45 


Best reduced dimension (=2) 

0.18 

0.42 


Softmax 

0.14 

0.48 

(11) 

FDA/MARS (degree = 2) 

0.02 

0.42 


Best reduced dimension (=6) 

0.13 

0.39 


Softmax 

0.10 

0.50 


12.5.1 Computing the FDA Estimates 

The computations for the FDA coordinates can be simplified in many im¬ 
portant cases, in particular when the nonparametric regression procedure 
can be represented as a linear operator. We will denote this operator by 
S a; that is, y = S^y, where y is the vector of responses and y the vector 
of fits. Additive splines have this property, if the smoothing parameters are 
fixed, as does MARS once the basis functions are selected. The subscript A 
denotes the entire set of smoothing parameters. In this case optimal scoring 
is equivalent to a canonical correlation problem, and the solution can be 
computed by a single eigen-decomposition. This is pursued in Exercise 12.6, 
and the resulting algorithm is presented here. 

We create an N x K indicator response matrix Y from the responses gi, 
such that yik = 1 if Qi = k, otherwise yik = 0. For a five-class problem Y 
might look like the following: 
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Ci C 2 


ffi=2 / 
92 = 1 

53 = 1 

54 = 5 

55 = 4 

Sat = 3 \ 

Here are the computational 


0 1 
1 0 
1 0 
0 0 
0 0 


0 0 
steps: 


C 3 

0 

0 

0 

0 

0 


1 


c 4 c 5 
0 0 \ 
0 0 
0 0 
0 1 
1 0 


0 0 / 


1. Multivariate nonparametric regression. Fit a multiresponse, adaptive 

nonparametric regression of Y on X, giving fitted values Y. Let 
be the linear operator that fits the final chosen model, and r]*(x) be 
the vector of fitted regression functions. 

2. Optimal scores. Compute the eigen-decomposition of Y T Y = Y t SaY, 

where the eigenvectors © are normalized: © 2 D^© = I. Here = 
Y T Y/N is a diagonal matrix of the estimated class prior 
probabilities. 

3. Update the model from step 1 using the optimal scores: r](x) = © T ry* (x). 


The first of the K functions in rj(x) is the constant function— a trivial 
solution; the remaining K — 1 functions are the discriminant functions. The 
constant function, along with the normalization, causes all the remaining 
functions to be centered. 

Again can correspond to any regression method. When = Hx, the 
linear regression projection operator, then FDA is linear discriminant anal¬ 
ysis. The software that we reference in the Computational Considerations 
section on page 455 makes good use of this modularity; the fda function 
has a method= argument that allows one to supply any regression function, 
as long as it follows some natural conventions. The regression functions 
we provide allow for polynomial regression, adaptive additive models and 
MARS. They all efficiently handle multiple responses, so step (1) is a single 
call to a regression routine. The eigen-decomposition in step (2) simulta¬ 
neously computes all the optimal scoring functions. 

In Section 4.2 we discussed the pitfalls of using linear regression on an 
indicator response matrix as a method for classification. In particular, se¬ 
vere masking can occur with three or more classes. FDA uses the fits from 
such a regression in step (1), but then transforms them further to produce 
useful discriminant functions that are devoid of these pitfalls. Exercise 12.9 
takes another view of this phenomenon. 
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12.6 Penalized Discriminant Analysis 


Although FDA is motivated by generalizing optimal scoring, it can also be 
viewed directly as a form of regularized discriminant analysis. Suppose the 
regression procedure used in FDA amounts to a linear regression onto a 
basis expansion h(X ), with a quadratic penalty on the coefficients: 


ASRWufc }$ =1 ) 


1=1 \_i=l 


(12.57) 


The choice of ft depends on the problem. If r]((x) = h(x)/3e is an expansion 
on spline basis functions, ft might constrain r)£ to be smooth over IRA In 
the case of additive splines, there are N spline basis functions for each 
coordinate, resulting in a total of Np basis functions in h(x); ft in this case 
is Np x Np and block diagonal. 

The steps in FDA can then be viewed as a generalized form of LDA, 
which we call penalized discriminant analysis, or PDA: 

• Enlarge the set of predictors X via a basis expansion h{X). 

• Use (penalized) LDA in the enlarged space, where the penalized 
Mahalanobis distance is given by 

D(x,fi) = (h(x) - h{p)) T {H w + A fl)- 1 ^) - h(n)), (12.58) 

where "S\y is the within-class covariance matrix of the derived vari¬ 
ables h(xi). 

• Decompose the classification subspace using a penalized metric: 

maxidSBetW subject to u T (H,w + A fl)u = 1. 


Loosely speaking, the penalized Mahalanobis distance tends to give less 
weight to “rough” coordinates, and more weight to “smooth” ones; since 
the penalty is not diagonal, the same applies to linear combinations that 
are rough or smooth. 

For some classes of problems, the first step, involving the basis expansion, 
is not needed; we already have far too many (correlated) predictors. A 
leading example is when the objects to be classified are digitized analog 
signals: 

• the log-periodogram of a fragment of spoken speech, sampled at a set 
of 256 frequencies; see Figure 5.5 on page 149. 

• the grayscale pixel values in a digitized image of a handwritten digit. 
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LDA: Coefficient 1 PDA: Coefficient 1 


LDA: Coefficient 2 PDA: Coefficient 2 


LDA: Coefficient 3 PDA: Coefficient 3 




FIGURE 12.11. The images appear in pairs, and represent the nine discrim¬ 
inant coefficient functions for the digit recognition problem. The left member of 
each pair is the LDA coefficient, while the right member is the PDA coefficient, 
regularized to enforce spatial smoothness. 


It is also intuitively clear in these cases why regularization is needed. 
Take the digitized image as an example. Neighboring pixel values will tend 
to be correlated, being often almost the same. This implies that the pair 
of corresponding LDA coefficients for these pixels can be wildly different 
and opposite in sign, and thus cancel when applied to similar pixel values. 
Positively correlated predictors lead to noisy, negatively correlated coeffi¬ 
cient estimates, and this noise results in unwanted sampling variance. A 
reasonable strategy is to regularize the coefficients to be smooth over the 
spatial domain, as with images. This is what PDA does. The computations 
proceed just as for FDA, except that an appropriate penalized regression 
method is used. Here h T (X)ffi = Xffi, and H is chosen so that 
penalizes roughness in ffi when viewed as an image. Figure 1.2 on page 4 
shows some examples of handwritten digits. Figure 12.11 shows the dis¬ 
criminant variates using LDA and PDA. Those produced by LDA appear 
as salt-and-pepper images, while those produced by PDA are smooth im¬ 
ages. The first smooth image can be seen as the coefficients of a linear 
contrast functional for separating images with a dark central vertical strip 
(ones, possibly sevens) from images that are hollow in the middle (zeros, 
some fours). Figure 12.12 supports this interpretation, and with more dif¬ 
ficulty allows an interpretation of the second coordinate. This and other 
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PDA: Discriminant Coordinate 1 

FIGURE 12.12. The first two penalized canonical variates, evaluated for the 
test data. The circles indicate the class centroids. The first coordinate contrasts 
mainly 0 ’s and 1 ’s, while the second contrasts 6 ’s and 7 /9 ’s. 
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examples are discussed in more detail in Hastie et al. (1995), who also show 
that the regularization improves the classification performance of LDA on 
independent test data by a factor of around 25% in the cases they tried. 


12.7 Mixture Discriminant Analysis 


Linear discriminant analysis can be viewed as a prototype classifier. Each 
class is represented by its centroid, and we classify to the closest using an 
appropriate metric. In many situations a single prototype is not sufficient 
to represent inhomogeneous classes, and mixture models are more appro¬ 
priate. In this section we review Gaussian mixture models and show how 
they can be generalized via the FDA and PDA methods discussed earlier. 
A Gaussian mixture model for the fcth class has density 

Rk 

P(X\G = k) = E ^{X- fi kr , S), (12.59) 

r =1 


where the mixing proportions 7 T kr sum to one. This has R k prototypes for 
the /cth class, and in our specification, the same covariance matrix £ is 
used as the metric throughout. Given such a model for each class, the class 
posterior probabilities are given by 


P{G = k\X = x) 


EE 7T fcr .(ft(A; Ukr, £)II fc 

Ef=iEfii^(M^r,s)n/ 


(12.60) 


where II^ represent the class prior probabilities. 

We saw these calculations for the special case of two components in 
Chapter 8. As in LDA, we estimate the parameters by maximum likelihood, 
using the joint log-likelihood based on P(G,X): 


K 


EE lo s 

k =1 gi=k 


" Rk 

_r= 1 


(12.61) 


The sum within the log makes this a rather messy optimization problem 
if tackled directly. The classical and natural method for computing the 
maximum-likelihood estimates (MLEs) for mixture distributions is the EM 
algorithm (Dempster et al., 1977), which is known to possess good conver¬ 
gence properties. EM alternates between the two steps: 
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E-step: Given the current parameters, compute the responsibility of sub¬ 
class Ckr within class k for each of the class -k observations (< 7 * = k ): 


fA(Cfcr | %i 5 <?z) 


XX— l 5 [^kii S) 


(12.62) 


M-step: Compute the weighted MLEs for the parameters of each of the 
component Gaussians within each of the classes, using the weights 
from the E-step. 

In the E-step, the algorithm apportions the unit weight of an observation 
in class k to the various subclasses assigned to that class. If it is close to the 
centroid of a particular subclass, and far from the others, it will receive a 
mass close to one for that subclass. On the other hand, observations halfway 
between two subclasses will get approximately equal weight for both. 

In the M-step, an observation in class k is used Rk times, to estimate the 
parameters in each of the Rk component densities, with a different weight 
for each. The EM algorithm is studied in detail in Chapter 8 . The algorithm 
requires initialization, which can have an impact, since mixture likelihoods 
are generally multimodal. Our software (referenced in the Computational 
Considerations on page 455) allows several strategies; here we describe the 
default. The user supplies the number Rk of subclasses per class. Within 
class k , a fc-means clustering model, with multiple random starts, is fitted 
to the data. This partitions the observations into Rk disjoint groups, from 
which an initial weight matrix, consisting of zeros and ones, is created. 

Our assumption of an equal component covariance matrix S throughout 
buys an additional simplicity; we can incorporate rank restrictions in the 
mixture formulation just like in LDA. To understand this, we review a little- 
known fact about LDA. The rank-L LDA fit (Section 4.3.3) is equivalent to 
the maximum-likelihood fit of a Gaussian model,where the different mean 
vectors in each class are confined to a rank-L subspace of IR P (Exercise 4.8). 
We can inherit this property for the mixture model, and maximize the log- 
likelihood (12.61) subject to rank constraints on all the XO k Rk centroids: 

rank{/r fc f} = L. 

Again the EM algorithm is available, and the M-step turns out to be 
a weighted version of LDA, with R = Rk “classes.” Furthermore, 

we can use optimal scoring as before to solve the weighted LDA problem, 
which allows us to use a weighted version of FDA or PDA at this stage. 
One would expect, in addition to an increase in the number of “classes,” a 
similar increase in the number of “observations” in the /cth class by a factor 
of Rk ■ It turns out that this is not the case if linear operators are used for 
the optimal scoring regression. The enlarged indicator Y matrix collapses 
in this case to a blurred response matrix Z, which is intuitively pleasing. 
For example, suppose there are K = 3 classes, and Rk = 3 subclasses per 
class. Then Z might be 
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where the entries in a class-fc row correspond to W(ck r \x, gi). 
The remaining steps are the same: 

z = sz ) 

Z T Z = ©D0 t > M-Step of MDA. 

Update 7rs and IIs J 


(12.63) 


These simple modifications add considerable flexibility to the mixture 
model: 


• The dimension reduction step in LDA, FDA or PDA is limited by 
the number of classes; in particular, for K = 2 classes no reduction is 
possible. MDA substitutes subclasses for classes, and then allows us 
to look at low-dimensional views of the subspace spanned by these 
subclass centroids. This subspace will often be an important one for 
discrimination. 


• By using FDA or PDA in the M-step, we can adapt even more to par¬ 
ticular situations. For example, we can fit MDA models to digitized 
analog signals and images, with smoothness constraints built in. 

Figure 12.13 compares FDA and MDA on the mixture example. 


12.7.1 Example: Waveform Data 

We now illustrate some of these ideas on a popular simulated example, 
taken from Breiman et al. (1984, pages 49-55), and used in Hastie and 
Tibshirani (1996b) and elsewhere. It is a three-class problem with 21 vari¬ 
ables, and is considered to be a difficult pattern recognition problem. The 
predictors are defined by 


Xj 

= Uhi(j) + (1 - U)h 2 {j) + ej 

Class 1, 


Xj 

= Uhi(j) + (1 - U)h 3 (j) + ej 

Class 2, 

(12.64) 

Xj 

= Uh 2 (j) + (1 - U)h 3 (j) + ej 

Class 3, 


where j = 1,2,.. 

., 21, U is uniform on (0,1), €j are 

standard normal vari- 


ates, and the h( are the shifted triangular waveforms: h\{j) = max(6 — 
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FDA / MARS - Degree 2 




FIGURE 12.13. FDA and MDA on the mixture data. The upper plot uses 
FDA with MARS as the regression procedure. The lower plot uses MDA with 
five mixture centers per class (indicated). The MDA solution is close to Bayes 
optimal, as might he expected given the data arise from mixtures of Gaussians. 
The broken purple curve in the background is the Bayes decision boundary. 
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Class 1 



FIGURE 12.14. Some examples of the waveforms generated from model (12.6f) 
before the Gaussian noise is added. 


|j - 111j 0), h 2 {j) = h\(j - 4) and h 3 (j) = hi(j +4). Figure 12.14 shows 
some example waveforms from each class. 

Table 12.4 shows the results of MDA applied to the waveform data, as 
well as several other methods from this and other chapters. Each train¬ 
ing sample has 300 observations, and equal priors were used, so there are 
roughly 100 observations in each class. We used test samples of size 500. 
The two MDA models are described in the caption. 

Figure 12.15 shows the leading canonical variates for the penalized MDA 
model, evaluated at the test data. As we might have guessed, the classes 
appear to lie on the edges of a triangle. This is because the hj (i) are repre¬ 
sented by three points in 21-space, thereby forming vertices of a triangle, 
and each class is represented as a convex combination of a pair of vertices, 
and hence lie on an edge. Also it is clear visually that all the information 
lies in the first two dimensions; the percentage of variance explained by the 
first two coordinates is 99.8%, and we would lose nothing by truncating the 
solution there. The Bayes risk for this problem has been estimated to be 
about 0.14 (Breiman et al., 1984). MDA comes close to the optimal rate, 
which is not surprising since the structure of the MDA model is similar to 
the generating model. 
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TABLE 12.4. Results for waveform data. The values are averages over ten sim¬ 
ulations, with the standard error of the average in parentheses. The five entries 
above the line are taken from Hastie et al. (1994). The first model below the line 
is MDA with three subclasses per class. The next line is the same, except that the 
discriminant coefficients are penalized via a roughness penalty to effectively 4df. 
The third is the corresponding penalized LDA or PDA model. 


Technique 

Error Rates 


Training 

Test 

LDA 

QDA 

CART 

FDA/MARS (degree = 1) 
FDA/MARS (degree = 2) 

0.121(0.006) 

0.039(0.004) 

0.072(0.003) 

0.100(0.006) 

0.068(0.004) 

0.191(0.006) 

0.205(0.006) 

0.289(0.004) 

0.191(0.006) 

0.215(0.002) 

MDA (3 subclasses) 

MDA (3 subclasses, penalized 4 df) 
PDA (penalized 4 df) 

Bayes 

0.087(0.005) 

0.137(0.006) 

0.150(0.005) 

0.169(0.006) 

0.157(0.005) 

0.171(0.005) 

0.140 


3 Subclasses, Penalized 4 df 


3 Subclasses, Penalized 4 df 



Discriminant Var 1 


Discriminant Var 3 


FIGURE 12.15. Some two-dimensional views of the MDA model fitted to a 
sample of the waveform model. The points are independent test data, projected 
on to the leading two canonical coordinates (left panel), and the third and fourth 
(right panel). The subclass centers are indicated. 
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Computational Considerations 

With N training cases, p predictors, and m support vectors, the support 
vector machine requires m 3 + mN + mpN operations, assuming m ps N. 
They do not scale well with N, although computational shortcuts are avail¬ 
able (Platt, 1999). Since these are evolving rapidly, the reader is urged to 
search the web for the latest technology. 

LDA requires Np 2 + p 3 operations, as does PDA. The complexity of 
FDA depends on the regression method used. Many techniques are linear 
in N, such as additive models and MARS. General splines and kernel-based 
regression methods will typically require N 3 operations. 

Software is available for fitting FDA, PDA and MDA models in the R 
package mda, which is also available in S-PLUS. 


Bibliographic Notes 

The theory behind support vector machines is due to Vapnik and is de¬ 
scribed in Vapnik (1996). There is a burgeoning literature on SVMs; an 
online bibliography, created and maintained by Alex Smola and Bernhard 
Scholkopf, can be found at: 

http://www.kernel-machines.org. 

Our treatment is based on Wahba et al. (2000) and Evgeniou et al. (2000), 
and the tutorial by Burges (Burges, 1998). 

Linear discriminant analysis is due to Fisher (1936) and Rao (1973). The 
connection with optimal scoring dates back at least to Breiman and Ihaka 
(1984), and in a simple form to Fisher (1936). There are strong connections 
with correspondence analysis (Greenacre, 1984). The description of flexible, 
penalized and mixture discriminant analysis is taken from Hastie et al. 
(1994), Hastie et al. (1995) and Hastie and Tibshirani (1996b), and all 
three are summarized in Hastie et al. (2000); see also Ripley (1996). 


Exercises 


Ex. 12.1 Show that the criteria (12.25) and (12.8) are equivalent. 

Ex. 12.2 Show that the solution to (12.29) is the same as the solution to 
(12.25) for a particular kernel. 

Ex. 12.3 Consider a modification to (12.43) where you do not penalize the 
constant. Formulate the problem, and characterize its solution. 

Ex. 12.4 Suppose you perform a reduced-subspace linear discriminant anal¬ 
ysis for a AT-group problem. You compute the canonical variables of di- 
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mension L < K — 1 given by z = XJ T x , where U is the p x L matrix of 
discriminant coefficients, and p > K is the dimension of x. 

(a) If L = K — 1 show that 

I \Z ~ Zkf ~ \\Z~ Z k ’\\ 2 = \\X-X k \\ 2 w ~ H* - Xk'\\w > 

where ||-|| w denotes Mahalanobis distance with respect to the covari¬ 
ance W. 

(b) If L < K — 1, show that the same expression on the left measures 
the difference in Mahalanobis squared distances for the distributions 
projected onto the subspace spanned by U. 

Ex. 12.5 The data in phoneme.subset, available from this book’s website 

http://www-stat.stanford.edu/ElemStatLearn 

consists of digitized log-periodograms for phonemes uttered by 60 speakers, 
each speaker having produced phonemes from each of five classes. It is 
appropriate to plot each vector of 256 “features” against the frequencies 
0-255. 

(a) Produce a separate plot of all the phoneme curves against frequency 

for each class. 

(b) You plan to use a nearest prototype classification scheme to classify 
the curves into phoneme classes. In particular, you will use a It"-means 
clustering algorithm in each class (kmeansO in R), and then classify 
observations to the class of the closest cluster center. The curves are 
high-dimensional and you have a rather small sample-size-to-variables 
ratio. You decide to restrict all the prototypes to be smooth functions 
of frequency. In particular, you decide to represent each prototype m 
as m = B9 where B is a 256 x J matrix of natural spline basis 
functions with J knots uniformly chosen in (0,255) and boundary 
knots at 0 and 255. Describe how to proceed analytically, and in 
particular, how to avoid costly high-dimensional fitting procedures. 
(Hint: It may help to restrict B to be orthogonal.) 

(c) Implement your procedure on the phoneme data, and try it out. Divide 

the data into a training set and a test set (50-50), making sure that 
speakers are not split across sets (why?). Use K = 1,3, 5,7 centers 
per class, and for each use J = 5 , 10,15 knots (taking care to start 
the LC-means procedure at the same starting values for each value of 
J), and compare the results. 

Ex. 12.6 Suppose that the regression procedure used in FDA (Section 12.5.1) 
is a linear expansion of basis functions h m (x), m = 1,... ,M. Let D T = 
Y t Y /N be the diagonal matrix of class proportions. 
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(a) Show that the optimal scoring problem (12.52) can be written in vector 

notation as 

min||Y0-H/3|| 2 , (12.65) 

where 9 is a vector of K real numbers, and H is the TV x M matrix 
of evaluations hj(xi). 

(b) Suppose that the normalization on 9 is 9 t T) 7T 1 = 0 and 9 t T>^9 = 1. 
Interpret these normalizations in terms of the original scored 9{gi). 

(c) Show that, with this normalization, (12.65) can be partially optimized 

w.r.t. /3, and leads to 

max0 T Y T SY6*, (12.66) 

subject to the normalization constraints, where S is the projection 
operator corresponding to the basis matrix H. 

(d) Suppose that the hj include the constant function. Show that the 
largest eigenvalue of S is 1. 

(e) Let © be a K x K matrix of scores (in columns), and suppose the 

normalization is 0 1 D^® = I. Show that the solution to (12.53) is 
given by the complete set of eigenvectors of S; the first eigenvector is 
trivial, and takes care of the centering of the scores. The remainder 
characterize the optimal scoring solution. 

Ex. 12.7 Derive the solution to the penalized optimal scoring problem 
(12.57). 

Ex. 12.8 Show that coefficients found by optimal scoring are proportional 
to the discriminant directions V( found by linear discriminant analysis. 

Ex. 12.9 Let Y = XB be the fitted N x K indicator response matrix after 
linear regression on the Nxp matrix X, where p > K. Consider the reduced 
features x* = B T Xj. Show that LDA using x* is equivalent to LDA in the 
original space. 

Ex. 12.10 Kernels and linear discriminant analysis. Suppose you wish to 
carry out a linear discriminant analysis (two classes) using a vector of 
transformations of the input variables h(x ). Since h(x) is liigh-dimensional, 
you will use a regularized within-class covariance matrix W/ t + 7 I. Show 
that the model can be estimated using only the inner products K(xi, xp) = 
{h(xi ), h(xi')). Hence the kernel property of support vector machines is also 
shared by regularized linear discriminant analysis. 

Ex. 12.11 The MDA procedure models each class as a mixture of Gaussians. 
Hence each mixture center belongs to one and only one class. A more 
general model allows each mixture center to be shared by all classes. We 
take the joint density of labels and features to be 
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R 

P{G,X) = ^7r r P r (G,X), (12.67) 

r =1 

a mixture of joint densities. Furthermore we assume 

P r {G,X)=P r (G)0{X-^ r ,'Z). (12.68) 


This model consists of regions centered at /i r , and for each there is a class 
profile P r {G). The posterior class distribution is given by 


P(G = k\X = ») = S) 

^0 r _l 7 T r (j){x i fl r , S) 


(12.69) 


where the denominator is the marginal distribution P(X). 


(a) Show that this model (called MDA2) can be viewed as a generalization 
of MDA since 


P(X\G = k) = 


J2r= 1 ^rPrjG = k)(j)(x\ fj, r , S) 
Sf=l 7r r-F > r(G = k) 


(12.70) 


where ir r k = ir r P r (G = k)/'^2 r=1 7r r P r (G = k) corresponds to the 
mixing proportions for the fcth class. 


(b) Derive the EM algorithm for MDA2. 

(c) Show that if the initial weight matrix is constructed as in MDA, in¬ 

volving separate fc-means clustering in each class, then the algorithm 
for MDA2 is identical to the original MDA procedure. 
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13 

Prototype Methods and 
N earest-Neighbors 


13.1 Introduction 

In this chapter we discuss some simple and essentially model-free methods 
for classification and pattern recognition. Because they are highly unstruc¬ 
tured, they typically are not useful for understanding the nature of the 
relationship between the features and class outcome. However, as black box 
prediction engines, they can be very effective, and are often among the best 
performers in real data problems. The nearest-neighbor technique can also 
be used in regression; this was touched on in Chapter 2 and works reason¬ 
ably well for low-dimensional problems. However, with high-dimensional 
features, the bias-variance tradeoff does not work as favorably for nearest- 
neighbor regression as it does for classification. 


13.2 Prototype Methods 

Throughout this chapter, our training data consists of the N pairs (xi, <?i), 
..., ( x n ,gN ) where g t is a class label taking values in {1,2,..., K}. Pro¬ 
totype methods represent the training data by a set of points in feature 
space. These prototypes are typically not examples from the training sam¬ 
ple, except in the case of 1-nearest-neighbor classification discussed later. 

Each prototype has an associated class label, and classification of a query 
point x is made to the class of the closest prototype. “Closest” is usually 
defined by Euclidean distance in the feature space, after each feature has 
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been standardized to have overall mean 0 and variance 1 in the training 
sample. Euclidean distance is appropriate for quantitative features. We 
discuss distance measures between qualitative and other kinds of feature 
values in Chapter 14. 

These methods can be very effective if the prototypes are well positioned 
to capture the distribution of each class. Irregular class boundaries can be 
represented, with enough prototypes in the right places in feature space. 
The main challenge is to figure out how many prototypes to use and where 
to put them. Methods differ according to the number and way in which 
prototypes are selected. 

13.2.1 K-means Clustering 

IF-means clustering is a method for finding clusters and cluster centers in a 
set of unlabeled data. One chooses the desired number of cluster centers, say 
R , and the iF-means procedure iteratively moves the centers to minimize 
the total within cluster variance. 1 Given an initial set of centers, the K- 
means algorithm alternates the two steps: 

• for each center we identify the subset of training points (its cluster) 
that is closer to it than any other center; 

• the means of each feature for the data points in each cluster are 
computed, and this mean vector becomes the new center for that 
cluster. 

These two steps are iterated until convergence. Typically the initial centers 
are R randomly chosen observations from the training data. Details of the 
IF-means procedure, as well as generalizations allowing for different variable 
types and more general distance measures, are given in Chapter 14. 

To use IF-means clustering for classification of labeled data, the steps 
are: 

• apply Jf-means clustering to the training data in each class sepa¬ 
rately, using R prototypes per class; 

• assign a class label to each of the K x R prototypes; 

• classify a new feature x to the class of the closest prototype. 

Figure 13.1 (upper panel) shows a simulated example with three classes 
and two features. We used R = 5 prototypes per class, and show the clas¬ 
sification regions and the decision boundary. Notice that a number of the 


1 The “A'” in A'-means refers to the number of cluster centers. Since we have already 
reserved K to denote the number of classes, we denote the number of clusters by R. 
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K-means - 5 Prototypes per Class 



LVQ - 5 Prototypes per Class 



FIGURE 13.1. Simulated example with three classes and five prototypes per 
class. The data in each class are generated from a mixture of Gaussians. In the 
upper panel, the prototypes were found by applying the K-means clustering algo¬ 
rithm separately in each class. In the lower panel, the LVQ algorithm (starting 
from the K-means solution) moves the prototypes away from the decision bound¬ 
ary. The broken purple curve in the background is the Bayes decision boundary. 
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Algorithm 13.1 Learning Vector Quantization — LVQ. 

1. Choose R initial prototypes for each class: mi(k), m^k ),..., mn(k), 
k = 1,2,..., K, for example, by sampling R training points at random 
from each class. 

2. Sample a training point Xi randomly (with replacement), and let (j. k) 
index the closest prototype irij(k) to Xi. 

(a) If gt = k (i.e., they are in the same class), move the prototype 
towards the training point: 

m j{k) <— mj(k) + e(xi — mj(k)), 
where e is the learning rate. 

(b) If gi V k (i.e., they are in different classes), move the prototype 
away from the training point: 

mj(k) t— 171 j(k) ~ e ( x i — 171 j(k))- 

3. Repeat step 2, decreasing the learning rate e with each iteration to¬ 
wards zero. 


prototypes are near the class boundaries, leading to potential misclassifica- 
tion errors for points near these boundaries. This results from an obvious 
shortcoming with this method: for each class, the other classes do not have 
a say in the positioning of the prototypes for that class. A better approach, 
discussed next, uses all of the data to position all prototypes. 


13.2.2 Learning Vector Quantization 

In this technique due to Kohonen (1989), prototypes are placed strategically 
with respect to the decision boundaries in an ad-hoc way. LVQ is an online 
algorithm—observations are processed one at a time. 

The idea is that the training points attract prototypes of the correct class, 
and repel other prototypes. When the iterations settle down, prototypes 
should be close to the training points in their class. The learning rate e is 
decreased to zero with each iteration, following the guidelines for stochastic 
approximation learning rates (Section 11.4.) 

Figure 13.1 (lower panel) shows the result of LVQ, using the A-means 
solution as starting values. The prototypes have tended to move away from 
the decision boundaries, and away from prototypes of competing classes. 

The procedure just described is actually called LVQ1. Modifications 
(LVQ2, LVQ3, etc.) have been proposed, that can sometimes improve per¬ 
formance. A drawback of learning vector quantization methods is the fact 
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that they are defined by algorithms, rather than optimization of some fixed 
criteria; this makes it difficult to understand their properties. 


13.2.3 Gaussian Mixtures 

The Gaussian mixture model can also be thought of as a prototype method, 
similar in spirit to AT-means and LVQ. We discuss Gaussian mixtures in 
some detail in Sections 6.8, 8.5 and 12.7. Each cluster is described in terms 
of a Gaussian density, which has a centroid (as in A'-means), and a covari¬ 
ance matrix. The comparison becomes crisper if we restrict the component 
Gaussians to have a scalar covariance matrix (Exercise 13.1). The two steps 
of the alternating EM algorithm are very similar to the two steps in K - 
means: 

• In the E-step, each observation is assigned a responsibility or weight 
for each cluster, based on the likelihood of each of the correspond¬ 
ing Gaussians. Observations close to the center of a cluster will most 
likely get weight 1 for that cluster, and weight 0 for every other clus¬ 
ter. Observations half-way between two clusters divide their weight 
accordingly. 

• In the M-step, each observation contributes to the weighted means 
(and covariances) for every cluster. 

As a consequence, the Gaussian mixture model is often referred to as a soft 
clustering method, while AT-means is hard. 

Similarly, when Gaussian mixture models are used to represent the fea¬ 
ture density in each class, it produces smooth posterior probabilities p(x) = 
{pi(x),... ,pk{x)} for classifying x (see (12.60) on page 449.) Often this 
is interpreted as a soft classification, while in fact the classification rule is 
G{x) = arg maxfc pk(x). Figure 13.2 compares the results of A'-means and 
Gaussian mixtures on the simulated mixture problem of Chapter 2. We 
see that although the decision boundaries are roughly similar, those for the 
mixture model are smoother (although the prototypes are in approximately 
the same positions.) We also see that while both procedures devote a blue 
prototype (incorrectly) to a region in the northwest, the Gaussian mixture 
classifier can ultimately ignore this region, while AT-means cannot. LVQ 
gave very similar results to A'-means on this example, and is not shown. 


13.3 /c-Nearest-Neighbor Classifiers 

These classifiers are memory-based , and require no model to be fit. Given 
a query point x $, we find the k training points X( r ),r = 1,..., k closest in 
distance to xq, and then classify using majority vote among the k neighbors. 
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K-means - 5 Prototypes per Class 



Gaussian Mixtures - 5 Subclasses per Class 



FIGURE 13.2. The upper panel shows the K-means classifier applied to the 
mixture data example. The decision boundary is piecewise linear. The lower panel 
shows a Gaussian mixture model with a common covariance for all component 
Gaussians. The EM algorithm for the mixture model was started at the K-means 
solution. The broken purple curve in the background is the Bayes decision 
boundary. 
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Ties are broken at random. For simplicity we will assume that the features 
are real-valued, and we use Euclidean distance in feature space: 

d (, i ) = lk ( i ) — acoll - ( 13 - 1 ) 

Typically we first standardize each of the features to have mean zero and 
variance 1, since it is possible that they are measured in different units. In 
Chapter 14 we discuss distance measures appropriate for qualitative and 
ordinal features, and how to combine them for mixed data. Adaptively 
chosen distance metrics are discussed later in this chapter. 

Despite its simplicity, fc-nearest-neighbors has been successful in a large 
number of classification problems, including handwritten digits, satellite 
image scenes and EKG patterns. It is often successful where each class 
has many possible prototypes, and the decision boundary is very irregular. 
Figure 13.3 (upper panel) shows the decision boundary of a 15-nearest- 
neighbor classifier applied to the three-class simulated data. The decision 
boundary is fairly smooth compared to the lower panel, where a 1-nearest- 
neighbor classifier was used. There is a close relationship between nearest- 
neighbor and prototype methods: in 1-nearest-neighbor classification, each 
training point is a prototype. 

Figure 13.4 shows the training, test and tenfold cross-validation errors 
as a function of the neighborhood size, for the two-class mixture problem. 
Since the tenfold CV errors are averages of ten numbers, we can estimate 
a standard error. 

Because it uses only the training point closest to the query point, the bias 
of the 1-nearest-neighbor estimate is often low, but the variance is high. 
A famous result of Cover and Hart (1967) shows that asymptotically the 
error rate of the 1-nearest-neighbor classifier is never more than twice the 
Bayes rate. The rough idea of the proof is as follows (using squared-error 
loss). We assume that the query point coincides with one of the training 
points, so that the bias is zero. This is true asymptotically if the dimension 
of the feature space is fixed and the training data fills up the space in a 
dense fashion. Then the error of the Bayes rule is just the variance of a 
Bernoulli random variate (the target at the query point), while the error of 
1-nearest-neighbor rule is twice the variance of a Bernoulli random variate, 
one contribution each for the training and query targets. 

We now give more detail for misclassification loss. At x let k* be the 
dominant class, and Pk{x) the true conditional probability for class k. Then 

Bayes error = l—pk»{x), (13-2) 

K 

1-nearest-neighbor error = ^ pk(x)(l — Pk(x)), (13.3) 

fc=l 

> 1 — Pk* (x). (13-4) 

The asymptotic 1-nearest-neighbor error rate is that of a random rule; we 
pick both the classification and the test point at random with probabili- 
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15-Nearest Neighbors 




FIGURE 13.3. k-nearest-neighbor classifiers applied to the simulation data of 
Figure 13.1. The broken purple curve in the background is the Bayes decision 
boundary. 
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Number of Neighbors 
7-Nearest Neighbors 



FIGURE 13.4. k-nearest-neighbors on the two-class mixture data. The upper 
panel shows the misclassification errors as a function of neighborhood size. Stan¬ 
dard error bars are included for 10-fold cross validation. The lower panel shows 
the decision boundary for 7-nearest-neighbors, which appears to be optimal for 
minimizing test error. The broken purple curve in the background is the Bayes 
decision boundary. 
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ties Pk(x), k = 1,..., if. For K = 2 the 1-nearest-neighbor error rate is 
2pfc»(x)(l — Pk*{x)) < 2(1 — pk*{x)) (twice the Bayes error rate). More 
generally, one can show (Exercise 13.3) 

K if 

^2pk{x)(l -p k {x)) < 2(1 -p fc .(x)) - _ (1 -p k *(x)) 2 . (13.5) 

fc=i 

Many additional results of this kind have been derived; Ripley (1996) sum¬ 
marizes a number of them. 

This result can provide a rough idea about the best performance that 
is possible in a given problem. For example, if the 1-nearest-neighbor rule 
has a 10% error rate, then asymptotically the Bayes error rate is at least 
5%. The kicker here is the asymptotic part, which assumes the bias of the 
nearest-neighbor rule is zero. In real problems the bias can be substantial. 
The adaptive nearest-neighbor rules, described later in this chapter, are an 
attempt to alleviate this bias. For simple nearest-neighbors, the bias and 
variance characteristics can dictate the optimal number of near neighbors 
for a given problem. This is illustrated in the next example. 


13.3.1 Example: A Comparative Study 

We tested the nearest-neighbors, if-means and LVQ classifiers on two sim¬ 
ulated problems. There are 10 independent features Xj, each uniformly 
distributed on [0,1]. The two-class 0-1 target variable is defined as follows: 


Y = I 



problem 1: “easy”, 


Y = I 





problem 2: “difficult.” 


(13.6) 


Hence in the first problem the two classes are separated by the hyperplane 
X\ = 1/2; in the second problem, the two classes form a checkerboard 
pattern in the hypercube defined by the first three features. The Bayes 
error rate is zero in both problems. There were 100 training and 1000 test 
observations. 

Figure 13.5 shows the mean and standard error of the misclassification 
error for nearest-neighbors, K -means and LVQ over ten realizations, as 
the tuning parameters are varied. We see that if-means and LVQ give 
nearly identical results. For the best choices of their tuning parameters, 
if-means and LVQ outperform nearest-neighbors for the first problem, and 
they perform similarly for the second problem. Notice that the best value 
of each tuning parameter is clearly situation dependent. For example 25- 
nearest-neighbors outperforms 1-nearest-neighbor by a factor of 70% in the 
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Nearest Neighbors / Easy 


K-means & LVQ / Easy 




Number of Neighbors 


Number of Prototypes per Class 


FIGURE 13.5. Mean ± one standard error of misclassification error for near¬ 
est-neighbor s, K-means (blue) and LVQ (red) over ten realizations for two sim¬ 
ulated problems: “easy” and “difficult,” described in the text. 
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FIGURE 13.6. The first four panels are LANDSAT images for an agricultural 
area in four spectral bands, depicted by heatmap shading. The remaining two 
panels give the actual land usage (color coded) and the predicted land usage using 
a five-nearest-neighbor rule described in the text. 


first problem, while 1-nearest-neighbor is best in the second problem by a 
factor of 18%. These results underline the importance of using an objective, 
data-based method like cross-validation to estimate the best value of a 
tuning parameter (see Figure 13.4 and Chapter 7). 


13.3.2 Example: k-Nearest-Neighbors and Image Scene 
Classification 

The STATLOG project (Michie et ah, 1994) used part of a LANDSAT 
image as a benchmark for classification (82 x 100 pixels). Figure 13.6 shows 
four heat-map images, two in the visible spectrum and two in the infrared, 
for an area of agricultural land in Australia. Each pixel has a class label 
from the 7-element set Q = {red soil, cotton, vegetation stubble, mixture, 
gray soil, damp gray soil, very damp gray soil}, determined manually by 
research assistants surveying the area. The lower middle panel shows the 
actual land usage, shaded by different colors to indicate the classes. The 
objective is to classify the land usage at a pixel, based on the information 
in the four spectral bands. 

Five-nearest-neighbors produced the predicted map shown in the bot¬ 
tom right panel, and was computed as follows. For each pixel we extracted 
an 8-neighbor feature map—the pixel itself and its 8 immediate neighbors 





















13.3 fc-Nearest-Neighbor Classifiers 471 






X 






FIGURE 13.7. A pixel and its 8-neighbor feature map. 


(see Figure 13.7). This is done separately in the four spectral bands, giving 
(1 + 8) x 4 = 36 input features per pixel. Then five-nearest-neighbors classi¬ 
fication was carried out in this 36-dimensional feature space. The resulting 
test error rate was about 9.5% (see Figure 13.8). Of all the methods used 
in the STATLOG project, including LVQ, CART, neural networks, linear 
discriminant analysis and many others, fc-nearest-neighbors performed best 
on this task. Hence it is likely that the decision boundaries in IR 36 are quite 
irregular. 


13.3.3 Invariant Metrics and Tangent Distance 

In some problems, the training features are invariant under certain natural 
transformations. The nearest-neighbor classifier can exploit such invari¬ 
ances by incorporating them into the metric used to measure the distances 
between objects. Here we give an example where this idea was used with 
great success, and the resulting classifier outperformed all others at the 
time of its development (Simard et al., 1993). 

The problem is handwritten digit recognition, as discussed is Chapter 1 
and Section 11.7. The inputs are grayscale images with 16 x 16 = 256 
pixels; some examples are shown in Figure 13.9. At the top of Figure 13.10, 
a “3” is shown, in its actual orientation (middle) and rotated 7.5° and 15° 
in either direction. Such rotations can often occur in real handwriting, and 
it is obvious to our eye that this “3” is still a “3” after small rotations. 
Hence we want our nearest-neighbor classifier to consider these two “3”s 
to be close together (similar). However the 256 grayscale pixel values for a 
rotated “3” will look quite different from those in the original image, and 
hence the two objects can be far apart in Euclidean distance in IR 256 . 

We wish to remove the effect of rotation in measuring distances between 
two digits of the same class. Consider the set of pixel values consisting of 
the original “3” and its rotated versions. This is a one-dimensional curve in 
IR 256 , depicted by the green curve passing through the “3” in Figure 13.10. 
Figure 13.11 shows a stylized version of IR 256 , with two images indicated by 
Xi and xp. These might be two different “3”s, for example. Through each 
image we have drawn the curve of rotated versions of that image, called 
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FIGURE 13.8. Test-error performance for a number of classifiers, as reported 
by the STATLOG project. The entry DANN is a variant of k-nearest neighbors, 
using an adaptive metric (Section 13.j.2). 
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FIGURE 13.9. Examples of grayscale images of handwritten digits. 
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HI3I3HI3 


-15“ -7.5" 0" 7.5“ 15" 



a=-0.2 a=-0.1 a=0 a=0.1 a=0.2 


Linear equation for 
images above 



FIGURE 13.10. The top row shows a “3” in its original orientation (middle) 
and rotated versions of it. The green curve in the middle of the figure depicts 
this set of rotated “2>” in 256 -dimensional space. The red line is the tangent line 
to the curve at the original image, with some “ 3 ”s on this tangent line, and its 
equation shown at the bottom of the figure. 


invariance manifolds in this context. Now, rather than using the usual 
Euclidean distance between the two images, we use the shortest distance 
between the two curves. In other words, the distance between the two 
images is taken to be the shortest Euclidean distance between any rotated 
version of first image, and any rotated version of the second image. This 
distance is called an invariant metric. 

In principle one could carry out 1-nearest-neighbor classification using 
this invariant metric. However there are two problems with it. First, it is 
very difficult to calculate for real images. Second, it allows large trans¬ 
formations that can lead to poor performance. For example a “6” would 
be considered close to a “9” after a rotation of 180°. We need to restrict 
attention to small rotations. 

The use of tangent distance solves both of these problems. As shown in 
Figure 13.10, we can approximate the invariance manifold of the image 
“3” by its tangent at the original image. This tangent can be computed 
by estimating the direction vector from small rotations of the image, or by 
more sophisticated spatial smoothing methods (Exercise 13.4.) For large 
rotations, the tangent image no longer looks like a “3,” so the problem 
with large transformations is alleviated. 
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FIGURE 13.11. Tangent distance computation for two images Xi and Xi>. 
Rather than using the Euclidean distance between Xi and x^, or the shortest 
distance between the two curves, we use the shortest distance between the two 
tangent lines. 


The idea then is to compute the invariant tangent line for each training 
image. For a query image to be classified, we compute its invariant tangent 
line, and find the closest line to it among the lines in the training set. The 
class (digit) corresponding to this closest line is our predicted class for the 
query image. In Figure 13.11 the two tangent fines intersect, but this is only 
because we have been forced to draw a two-dimensional representation of 
the actual 256-dimensional situation. In IR 256 the probability of two such 
fines intersecting is effectively zero. 

Now a simpler way to achieve this invariance would be to add into the 
training set a number of rotated versions of each training image, and then 
just use a standard nearest-neighbor classifier. This idea is called “hints” in 
Abu-Mostafa (1995), and works well when the space of invariances is small. 
So far we have presented a simplified version of the problem. In addition to 
rotation, there are six other types of transformations under which we would 
like our classifier to be invariant. There are translation (two directions), 
scaling (two directions), sheer, and character thickness. Hence the curves 
and tangent lines in Figures 13.10 and 13.11 are actually 7-dimensional 
manifolds and hyperplanes. It is infeasible to add transformed versions 
of each training image to capture all of these possibilities. The tangent 
manifolds provide an elegant way of capturing the invariances. 

Table 13.1 shows the test misclassification error for a problem with 7291 
training images and 2007 test digits (the U.S. Postal Services database), for 
a carefully constructed neural network, and simple 1-nearest-neighbor and 
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TABLE 13.1. Test error rates for the handwritten ZIP code problem. 


Method 

Error rate 

Neural-net 

0.049 

1-nearest-neighbor/Euclidean distance 

0.055 

1-nearest-neighbor/tangent distance 

0.026 


tangent distance 1-nearest-neighbor rules. The tangent distance nearest- 
neighbor classifier works remarkably well, with test error rates near those 
for the human eye (this is a notoriously difficult test set). In practice, 
it turned out that nearest-neighbors are too slow for online classification 
in this application (see Section 13.5), and neural network classifiers were 
subsequently developed to mimic it. 


13.4 Adaptive Nearest-Neighbor Methods 

When nearest-neighbor classification is carried out in a high-dimensional 
feature space, the nearest neighbors of a point can be very far away, causing 
bias and degrading the performance of the rule. 

To quantify this, consider N data points uniformly distributed in the unit 
cube [— |, l] p - Let R be the radius of a 1-nearest-neighborhood centered at 
the origin. Then 

/ 1i/p 

median(i?) = — - J , (13-7) 

where v p r p is the volume of the sphere of radius r in p dimensions. Fig¬ 
ure 13.12 shows the median radius for various training sample sizes and 
dimensions. We see that median radius quickly approaches 0.5, the dis¬ 
tance to the edge of the cube. 

What can be done about this problem? Consider the two-class situation 
in Figure 13.13. There are two features, and a nearest-neighborhood at 
a query point is depicted by the circular region. Implicit in near-neighbor 
classification is the assumption that the class probabilities are roughly con¬ 
stant in the neighborhood, and hence simple averages give good estimates. 
However, in this example the class probabilities vary only in the horizontal 
direction. If we knew this, we would stretch the neighborhood in the verti¬ 
cal direction, as shown by the tall rectangular region. This will reduce the 
bias of our estimate and leave the variance the same. 

In general, this calls for adapting the metric used in nearest-neighbor 
classification, so that the resulting neighborhoods stretch out in directions 
for which the class probabilities don’t change much. In high-dimensional 
feature space, the class probabilities might change only a low-dimensional 
subspace and hence there can be considerable advantage to adapting the 
metric. 
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Dimension 

FIGURE 13.12. Median radius of a 1-nearest-neighborhood, for uniform data 
with N observations in p dimensions. 


5-Nearest Neighborhoods 



FIGURE 13.13. The points are uniform in the cube, with the vertical line sepa¬ 
rating class red and green. The vertical strip denotes the 5 -nearest-neighbor region 
using only the horizontal coordinate to find the nearest-neighbors for the target 
point (solid dot). The sphere shows the 5-nearest-neighbor region using both co¬ 
ordinates, and we see in this case it has extended into the class-red region (and 
is dominated by the wrong class in this instance). 
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Friedman (1994a) proposed a method in which rectangular neighbor¬ 
hoods are found adaptively by successively carving away edges of a box 
containing the training data. Here we describe the discriminant adaptive 
nearest-neighbor (DANN) rule of Hastie and Tibshirani (1996a). Earlier, 
related proposals appear in Short and Fukunaga (1981) and Myles and 
Hand (1990). 

At each query point a neighborhood of say 50 points is formed, and the 
class distribution among the points is used to decide how to deform the 
neighborhood- -that is, to adapt the metric. The adapted metric is then 
used in a nearest-neighbor rule at the query point. Thus at each query 
point a potentially different metric is used. 

In Figure 13.13 it is clear that the neighborhood should be stretched in 
the direction orthogonal to line joining the class centroids. This direction 
also coincides with the linear discriminant boundary, and is the direction 
in which the class probabilities change the least. In general this direction 
of maximum change will not be orthogonal to the line joining the class cen¬ 
troids (see Figure 4.9 on page 116.) Assuming a local discriminant model, 
the information contained in the local within- and between-class covari¬ 
ance matrices is all that is needed to determine the optimal shape of the 
neighborhood. 

The discriminant adaptive nearest-neighbor (DANN) metric at a query 
point Xo is defined by 

D(x,x 0 ) = {x - £o) t £(x- x 0 ), (13.8) 


where 


S = W _1/2 [W~ 1/2 BW~ 1/2 + eI]W~ 1/2 

= W- 1/2 [B* +eI]W” 1/2 . (13.9) 

Here W is the pooled within-class covariance matrix y'.fLi 7TfcWfc and B 
is the between class covariance matrix ~ x)(®fc — %) T , with 

W and B computed using only the 50 nearest neighbors around xq■ After 
computation of the metric, it is used in a nearest-neighbor rule at xq. 

This complicated formula is actually quite simple in its operation. It first 
spheres the data with respect to W, and then stretches the neighborhood 
in the zero-eigenvalue directions of B* (the between-matrix for the sphered 
data ). This makes sense, since locally the observed class means do not dif¬ 
fer in these directions. The e parameter rounds the neighborhood, from an 
infinite strip to an ellipsoid, to avoid using points far away from the query 
point. The value of e = 1 seems to work well in general. Figure 13.14 shows 
the resulting neighborhoods for a problem where the classes form two con¬ 
centric circles. Notice how the neighborhoods stretch out orthogonally to 
the decision boundaries when both classes are present in the neighborhood. 
In the pure regions with only one class, the neighborhoods remain circular; 
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FIGURE 13.14. Neighborhoods found by the DANN procedure, at various query 
points (centers of the crosses). There are two classes in the data, with one class 
surrounding the other. 50 nearest-neighbors were used to estimate the local met¬ 
rics. Shown are the resulting metrics used to form Vo-nearest-neighborhoods. 


in these cases the between matrix B = 0, and the S in (13.8) is the identity 
matrix. 


lS.^.l Example 

Here we generate two-class data in ten dimensions, analogous to the two- 
dimensional example of Figure 13.14. All ten predictors in class 1 are in¬ 
dependent standard normal, conditioned on the radius being greater than 
22.4 and less than 40, while the predictors in class 2 are independent stan¬ 
dard normal without the restriction. There are 250 observations in each 
class. Hence the first class almost completely surrounds the second class in 
the full ten-dimensional space. 

In this example there are no pure noise variables, the kind that a nearest- 
neighbor subset selection rule might be able to weed out. At any given 
point in the feature space, the class discrimination occurs along only one 
direction. However, this direction changes as we move across the feature 
space and all variables are important somewhere in the space. 

Figure 13.15 shows boxplots of the test error rates over ten realiza¬ 
tions, for standard 5-nearest-neighbors, LVQ, and discriminant adaptive 
5-nearest-neighbors. We used 50 prototypes per class for LVQ, to make 
it comparable to 5 nearest-neighbors (since 250/5 = 50). The adaptive 
metric significantly reduces the error rate, compared to LVQ or standard 
nearest-neighbors. 
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FIGURE 13.15. Ten-dimensional simulated example: boxplots of the test error 
rates over ten realizations, for standard 5 -nearest-neighbors, LVQ with 50 centers, 
and discriminant-adaptive 5-nearest-neighbors 

13.4-2 Global Dimension Reduction for Nearest-Neighbors 

The discriminant-adaptive nearest-neighbor method carries out local di¬ 
mension reduction—that is, dimension reduction separately at each query 
point. In many problems we can also benefit from global dimension re¬ 
duction, that is, apply a nearest-neighbor rule in some optimally chosen 
subspace of the original feature space. For example, suppose that the two 
classes form two nested spheres in four dimensions of feature space, and 
there are an additional six noise features whose distribution is independent 
of class. Then we would like to discover the important four-dimensional 
subspace, and carry out nearest-neighbor classification in that reduced sub¬ 
space. Hastie and Tibshirani (1996a) discuss a variation of the discriminant- 
adaptive nearest-neighbor method for this purpose. At each training point 
Xi, the between-centroids sum of squares matrix B; is computed, and then 
these matrices are averaged over all training points: 

B = (1 3. 10 ) 

2 = 1 

Let ei, e 2 ,..., e p be the eigenvectors of the matrix B, ordered from largest 
to smallest eigenvalue 0 Then these eigenvectors span the optimal sub¬ 
spaces for global subspace reduction. The derivation is based on the fact 
that the best rank-L approximation to B, B[l] = Y4e=i ®e e e e J i solves the 
least squares problem 

N 

min trace [(Bj — M) 2 1. (13.11) 

rank(M)=L 

i— 1 

Since each B, contains information on (a) the local discriminant subspace, 
and (b) the strength of discrimination in that subspace, (13.11) can be seen 
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as a way of finding the best approximating subspace of dimension L to a 
series of N subspaces by weighted least squares (Exercise 13.5.) 

In the four-dimensional sphere example mentioned above and examined 
in Hastie and Tibshirani (1996a), four of the eigenvalues Og turn out to be 
large (having eigenvectors nearly spanning the interesting subspace), and 
the remaining six are near zero. Operationally, we project the data into 
the leading four-dimensional subspace, and then carry out nearest neighbor 
classification. In the satellite image classification example in Section 13.3.2, 
the technique labeled DANN in Figure 13.8 used 5-nearest-neighbors in a 
globally reduced subspace. There are also connections of this technique 
with the sliced inverse regression proposal of Duan and Li (1991). These 
authors use similar ideas in the regression setting, but do global rather 
than local computations. They assume and exploit spherical symmetry of 
the feature distribution to estimate interesting subspaces. 


13.5 Computational Considerations 

One drawback of nearest-neighbor rules in general is the computational 
load, both in finding the neighbors and storing the entire training set. With 
N observations and p predictors, nearest-neighbor classification requires Np 
operations to find the neighbors per query point. There are fast algorithms 
for finding nearest-neighbors (Friedman et al., 1975; Friedman et al., 1977) 
which can reduce this load somewhat. Hastie and Simard (1998) reduce 
the computations for tangent distance by developing analogs of IL-means 
clustering in the context of this invariant metric. 

Reducing the storage requirements is more difficult, and various editing 
and condensing procedures have been proposed. The idea is to isolate a 
subset of the training set that suffices for nearest-neighbor predictions, and 
throw away the remaining training data. Intuitively, it seems important to 
keep the training points that are near the decision boundaries and on the 
correct side of those boundaries, while some points far from the boundaries 
could be discarded. 

The multi-edit algorithm of Devijver and Kittler (1982) divides the data 
cyclically into training and test sets, computing a nearest neighbor rule on 
the training set and deleting test points that are misclassified. The idea is 
to keep homogeneous clusters of training observations. 

The condensing procedure of Hart (1968) goes further, trying to keep 
only important exterior points of these clusters. Starting with a single ran¬ 
domly chosen observation as the training set, each additional data item is 
processed one at a time, adding it to the training set only if it is misclas¬ 
sified by a nearest-neighbor rule computed on the current training set. 

These procedures are surveyed in Dasarathy (1991) and Ripley (1996). 
They can also be applied to other learning procedures besides nearest- 
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neighbors. While such methods are sometimes useful, we have not had 
much practical experience with them, nor have we found any systematic 
comparison of their performance in the literature. 


Bibliographic Notes 

The nearest-neighbor method goes back at least to Fix and Hodges (1951). 
The extensive literature on the topic is reviewed by Dasarathy (1991); 
Chapter 6 of Ripley (1996) contains a good summary. A'-means cluster¬ 
ing is due to Lloyd (1957) and MacQueen (1967). Kohonen (1989) intro¬ 
duced learning vector quantization. The tangent distance method is due to 
Simard et al. (1993). Hastie and Tibshirani (1996a) proposed the discrim¬ 
inant adaptive nearest-neighbor technique. 


Exercises 

Ex. 13.1 Consider a Gaussian mixture model where the covariance matrices 
are assumed to be scalar: S r = oT Vr = 1,..., R, and a is a fixed param¬ 
eter. Discuss the analogy between the AT-means clustering algorithm and 
the EM algorithm for fitting this mixture model in detail. Show that in the 
limit a —> 0 the two methods coincide. 

Ex. 13.2 Derive formula (13.7) for the median radius of the 1-nearest- 
neighborhood. 

Ex. 13.3 Let E* be the error rate of the Bayes rule in a AT-class problem, 
where the true class probabilities are given by pk{x) , k = 1,... ,K. As¬ 
suming the test point and training point have identical features x, prove 


(13.5) 



where k* = argma XkPk(x). Hence argue that the error rate of the 1- 
nearest-neighbor rule converges in L 1; as the size of the training set in¬ 
creases, to a value E\, bounded above by 



(13.12) 


[This statement of the theorem of Cover and Hart (1967) is taken from 
Chapter 6 of Ripley (1996), where a short proof is also given]. 
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Ex. 13.4 Consider an image to be a function F{x) : IR 2 IR 1 over the two- 
dimensional spatial domain (paper coordinates). Then F(c+xq+A(x— *o)) 
represents an affine transformation of the image F. where A is a 2 x 2 
matrix. 

1. Decompose A (via Q-R) in such a way that parameters identifying 
the four affine transformations (two scale, shear and rotation) are 
clearly identified. 

2. Using the chain rule, show that the derivative of F(c+xo + A(x — a; 0 )) 
w.r.t. each of these parameters can be represented in terms of the two 
spatial derivatives of F. 

3. Using a two-dimensional kernel smoother (Chapter 6), describe how 
to implement this procedure when the images are quantized to 16 x 16 
pixels. 

Ex. 13.5 Let B, , i = 1, 2,..., N be square p x p positive semi-definite ma¬ 
trices and let B = (l/iV)^B,. Write the eigen-decomposition of B as 
S?=i ®e e e e J w ith 9( > 9(.-\ >•••>#i. Show that the best rank-L approx¬ 
imation for the Bj, 

N 

min V trace[(Bj — M) 2 ]. 

rank(M)=L ^ 
i= 1 

is given by B^j = X^=i ■ {Hint: Write Y^iLi trace[(B, — M) 2 ] as 

N N 

y trace[(B,- — B) 2 ] + y trace[(M — B) 2 ]). 

i=1 i=l 

Ex. 13.6 Here we consider the problem of shape averaging. In particular, 
Lj, i = 1,... ,M are each N x 2 matrices of points in IR 2 , each sampled 
from corresponding positions of handwritten (cursive) letters. We seek an 
affine invariant average V, also N x 2, V T V = /, of the M letters with 
the following property: V minimizes 

M 

y minllLj - VAjlj 2 . 

1=1 

Characterize the solution. 

This solution can suffer if some of the letters are big and dominate the 
average. An alternative approach is to minimize instead: 

M 

ymin||L,Ay v f. 

1=1 3 

Derive the solution to this problem. How do the criteria differ? Use the 
SVD of the Lj to simplify the comparison of the two approaches. 
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Ex. 13.7 Consider the application of nearest-neighbors to the “easy” and 
“hard” problems in the left panel of Figure 13.5. 

1. Replicate the results in the left panel of Figure 13.5. 

2. Estimate the misclassification errors using fivefold cross-validation, 
and compare the error rate curves to those in 1. 

3. Consider an “AlC-like” penalization of the training set misclassifica¬ 
tion error. Specifically, add 2 t/N to the training set misclassification 
error, where t is the approximate number of parameters N/r , r be¬ 
ing the number of nearest-neighbors. Compare plots of the resulting 
penalized misclassification error to those in 1 and 2. Which method 
gives a better estimate of the optimal number of nearest-neighbors: 
cross-validation or AIC? 

Ex. 13.8 Generate data in two classes, with two features. These features 
are all independent Gaussian variates with standard deviation 1. Their 
mean vectors are (—1,-1) in class 1 and (1,1) in class 2. To each feature 
vector apply a random rotation of angle 6, 9 chosen uniformly from 0 to 
27 t. Generate 50 observations from each class to form the training set, and 
500 in each class as the test set. Apply four different classifiers: 

1. Nearest-neighbors. 

2. Nearest-neighbors with hints: ten randomly rotated versions of each 
data point are added to the training set before applying nearest- 
neighbors. 

3. Invariant metric nearest-neighbors, using Euclidean distance invari¬ 
ant to rotations about the origin. 

4. Tangent distance nearest-neighbors. 

In each case choose the number of neighbors by tenfold cross-validation. 
Compare the results. 
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Unsupervised Learning 


This is page 485 
Printer: Opaque this 


14.1 Introduction 

The previous chapters have been concerned with predicting the values 
of one or more outputs or response variables Y = (Yi,..., Y m ) for a 
given set of input or predictor variables X T = {X \,..., X p ). Denote by 
xf = (xn ,..., Xi P ) the inputs for the ith training case, and let yi be a 
response measurement. The predictions are based on the training sample 
(xi,yi), ■ ■ ■, ( XN,yN ) of previously solved cases, where the joint values of 
all of the variables are known. This is called supervised learning or “learn¬ 
ing with a teacher.” Under this metaphor the “student” presents an an¬ 
swer iji for each Xi in the training sample, and the supervisor or “teacher” 
provides either the correct answer and/or an error associated with the stu¬ 
dent’s answer. This is usually characterized by some loss function L(y,y ), 
for example, L(y,y) = (y - y) 2 . 

If one supposes that (X, Y) are random variables represented by some 
joint probability density Pr(X, Y), then supervised learning can be formally 
characterized as a density estimation problem where one is concerned with 
determining properties of the conditional density Pr(U|X). Usually the 
properties of interest are the “location” parameters y that minimize the 
expected error at each x, 

y{x) = axgminE Y \xL(Y, 9). 
e 


(14.1) 
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Conditioning one has 


Pr(X, Y) = Pr(T|X) • Pr(X), 

where Pr(X) is the joint marginal density of the X values alone. In su¬ 
pervised learning Pr(Al) is typically of no direct concern. One is interested 
mainly in the properties of the conditional density Pr(Y|X). Since Y is of¬ 
ten of low dimension (usually one), and only its location is of interest, 
the problem is greatly simplified. As discussed in the previous chapters, 
there are many approaches for successfully addressing supervised learning 
in a variety of contexts. 

In this chapter we address unsupervised learning or “learning without a 
teacher.” In this case one has a set of N observations (x\, x%, ■ • •, x;v) of a 
random p -vector X having joint density Pr(X). The goal is to directly infer 
the properties of this probability density without the help of a supervisor or 
teacher providing correct answers or degree-of-error for each observation. 
The dimension of X is sometimes much higher than in supervised learn¬ 
ing, and the properties of interest are often more complicated than simple 
location estimates. These factors are somewhat mitigated by the fact that 
X represents all of the variables under consideration; one is not required 
to infer how the properties of Pr(X) change, conditioned on the changing 
values of another set of variables. 

In low-dimensional problems (say p < 3), there are a variety of effective 
nonparametric methods for directly estimating the density Pr(X) itself at 
all X-values, and representing it graphically (Silverman, 1986, e.g.). Owing 
to the curse of dimensionality, these methods fail in high dimensions. One 
must settle for estimating rather crude global models, such as Gaussian 
mixtures or various simple descriptive statistics that characterize Pr(X). 

Generally, these descriptive statistics attempt to characterize X-values, 
or collections of such values, where Pr(X) is relatively large. Principal 
components, multidimensional scaling, self-organizing maps, and principal 
curves, for example, attempt to identify low-dimensional manifolds within 
the X-space that represent high data density. This provides information 
about the associations among the variables and whether or not they can be 
considered as functions of a smaller set of “latent” variables. Cluster anal¬ 
ysis attempts to find multiple convex regions of the X-space that contain 
modes of Pr(X). This can tell whether or not Pr(X) can be represented by 
a mixture of simpler densities representing distinct types or classes of ob¬ 
servations. Mixture modeling has a similar goal. Association rules attempt 
to construct simple descriptions (conjunctive rules) that describe regions 
of high density in the special case of very high dimensional binary-valued 
data. 

With supervised learning there is a clear measure of success, or lack 
thereof, that can be used to judge adequacy in particular situations and 
to compare the effectiveness of different methods over various situations. 
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Lack of success is directly measured by expected loss over the joint dis¬ 
tribution Pr(X, Y). This can be estimated in a variety of ways including 
cross-validation. In the context of unsupervised learning, there is no such 
direct measure of success. It is difficult to ascertain the validity of inferences 
drawn from the output of most unsupervised learning algorithms. One must 
resort to heuristic arguments not only for motivating the algorithms, as is 
often the case in supervised learning as well, but also for judgments as to 
the quality of the results. This uncomfortable situation has led to heavy 
proliferation of proposed methods, since effectiveness is a matter of opinion 
and cannot be verified directly. 

In this chapter we present those unsupervised learning techniques that 
are among the most commonly used in practice, and additionally, a few 
others that are favored by the authors. 


14.2 Association Rules 

Association rule analysis has emerged as a popular tool for mining com¬ 
mercial data bases. The goal is to find joint values of the variables X = 
(Xi,X 2 , ■ ■ . ,X p ) that appear most frequently in the data base. It is most 
often applied to binary-valued data Xj £ {0,1}, where it is referred to as 
“market basket” analysis. In this context the observations are sales trans¬ 
actions, such as those occurring at the checkout counter of a store. The 
variables represent all of the items sold in the store. For observation i, each 
variable X, ; is assigned one of two values; Xij = 1 if the jth item is pur¬ 
chased as part of the transaction, whereas = 0 if it was not purchased. 
Those variables that frequently have joint values of one represent items that 
are frequently purchased together. This information can be quite useful for 
stocking shelves, cross-marketing in sales promotions, catalog design, and 
consumer segmentation based on buying patterns. 

More generally, the basic goal of association rule analysis is to find a 
collection of prototype X-values v\,... ,Vl for the feature vector X, such 
that the probability density Pr(uj) evaluated at each of those values is rela¬ 
tively large. In this general framework, the problem can be viewed as “mode 
finding” or “bump hunting.” As formulated, this problem is impossibly dif¬ 
ficult. A natural estimator for each Pr(rij) is the fraction of observations 
for which X = vi. For problems that involve more than a small number 
of variables, each of which can assume more than a small number of val¬ 
ues, the number of observations for which X = vi will nearly always be too 
small for reliable estimation. In order to have a tractable problem, both the 
goals of the analysis and the generality of the data to which it is applied 
must be greatly simplified. 

The first simplification modifies the goal. Instead of seeking values x 
where Pr(;r) is large, one seeks regions of the X-space with high probability 
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content relative to their size or support. Let Sj represent the set of all 
possible values of the jth variable (its support ), and let Sj C Sj be a subset 
of these values. The modified goal can be stated as attempting to find 
subsets of variable values si,...,s p such that the probability of each of the 
variables simultaneously assuming a value within its respective subset, 


Pr 


nc*i e Sj ) 

j=i 


(14.2) 


is relatively large. The intersection of subsets n^ =1 (X,j e Sj) is called a 
conjunctive rule. For quantitative variables the subsets Sj are contiguous 
intervals; for categorical variables the subsets are delineated explicitly. Note 
that if the subset Sj is in fact the entire set of values Sj = Sj, as is often 
the case, the variable X ;/ is said not to appear in the rule (14.2). 


14.2.1 Market Basket Analysis 


General approaches to solving (14.2) are discussed in Section 14.2.5. These 
can be quite useful in many applications. However, they are not feasible 
for the very large (j> ~ 10 4 , N ss 10 8 ) commercial data bases to which 
market basket analysis is often applied. Several further simplifications of 
(14.2) are required. First, only two types of subsets are considered; either 
Sj consists of a single value of Xj , Sj = voj , or it consists of the entire set 
of values that Xj can assume, Sj = Sj. This simplifies the problem (14.2) 
to finding subsets of the integers J C {1,... ,p}, and corresponding values 
v oj, j £ 3 , such that 


Pr 


n(*i = v «y) 


(14.3) 


is large. Figure 14.1 illustrates this assumption. 

One can apply the technique of dummy variables to turn (14.3) into 
a problem involving only binary-valued variables. Here we assume that 
the support Sj is finite for each variable Xj. Specifically, a new set of 
variables Z\,..., Zk is created, one such variable for each of the values 
Vij attainable by each of the original variables X ±,..., X p . The number of 
dummy variables K is 

K = j2 IS,I, 

i=i 


where |<S ; | is the number of distinct values attainable by Xj. Each dummy 
variable is assigned the value Z^ = 1 if the variable with which it is as¬ 
sociated takes on the corresponding value to which Z & is assigned, and 
Zk = 0 otherwise. This transforms (14.3) to finding a subset of the integers 
K, C {1,..., K} such that 
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FIGURE 14.1. Simplifications for association rules. Here there are two inputs 
A'i and X 2 , taking four and six distinct values, respectively. The red squares 
indicate areas of high density. To simplify the computations, we assume that the 
derived subset corresponds to either a single value of an input or all values. With 
this assumption we could find either the middle or right pattern, but not the left 
one. 


Pr 


n 

.keic 


= Pr 


\Z k = 1 


LkalC 


(14.4) 


is large. This is the standard formulation of the market basket problem. 
The set 1C is called an “item set.” The number of variables Z k in the item 
set is called its “size” (note that the size is no bigger than p). The estimated 
value of (14.4) is taken to be the fraction of observations in the data base 
for which the conjunction in (14.4) is true: 


Pr 


n =!) 

_fce/c 


i=1 keic 


(14.5) 


Here Zi k is the value of Z k for this zth case. This is called the “support” or 
“prevalence” T{K) of the item set 1C. An observation i for which IlfcgA: Zik = 
1 is said to “contain” the item set 1C. 

In association rule mining a lower support bound t is specified, and one 
seeks all item sets YC\ that can be formed from the variables Z 1 ,..., Zk 
with support in the data base greater than this lower bound t 


{Ki\T{Ki)>t}. 


(14.6) 


14-2.2 The Apriori Algorithm 

The solution to this problem (14.6) can be obtained with feasible compu¬ 
tation for very large data bases provided the threshold t is adjusted so that 
(14.6) consists of only a small fraction of all 2 K possible item sets. The 
“Apriori” algorithm (Agrawal et ah, 1995) exploits several aspects of the 
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curse of dimensionality to solve (14.6) with a small number of passes over 
the data. Specifically, for a given support threshold t: 

• The cardinality |{/C| T(/C) > t}\ is relatively small. 


• Any item set C consisting of a subset of the items in K, must have 
support greater than or equal to that of 1C, C C /C => T(£) > T{1C). 


The first pass over the data computes the support of all single-item sets. 
Those whose support is less than the threshold are discarded. The second 
pass computes the support of all item sets of size two that can be formed 
from pairs of the single items surviving the first pass. In other words, to 
generate all frequent itemsets with |/C| = m, we need to consider only 
candidates such that all of their m ancestral item sets of size m — 1 are 
frequent. Those size-two item sets with support less than the threshold are 
discarded. Each successive pass over the data considers only those item 
sets that can be formed by combining those that survived the previous 
pass with those retained from the first pass. Passes over the data continue 
until all candidate rules from the previous pass have support less than the 
specified threshold. The Apriori algorithm requires only one pass over the 
data for each value of |/C|, which is crucial since we assume the data cannot 
be fitted into a computer’s main memory. If the data are sufficiently sparse 
(or if the threshold t is high enough), then the process will terminate in 
reasonable time even for huge data sets. 

There are many additional tricks that can be used as part of this strat¬ 
egy to increase speed and convergence (Agrawal et ah, 1995). The Apriori 
algorithm represents one of the major advances in data mining technology. 

Each high support item set 1C (14.6) returned by the Apriori algorithm is 
cast into a set of “association rules.” The items Zk, k € 1C, are partitioned 
into two disjoint subsets, Ad B = 1C, and written 

A => B. (14.7) 


The first item subset A is called the “antecedent” and the second B the 
“consequent.” Association rules are defined to have several properties based 
on the prevalence of the antecedent and consequent item sets in the data 
base. The “support” of the rule T(A => B) is the fraction of observations 
in the union of the antecedent and consequent, which is just the support 
of the item set K. from which they were derived. It can be viewed as an 
estimate (14.5) of the probability of simultaneously observing both item 
sets Pr(A and B) in a randomly selected market basket. The “confidence” 
or “predictability” C(A => B) of the rule is its support divided by the 
support of the antecedent 


C(A =>- B) 


T(A => B) 
T(A) 


(14.8) 


which can be viewed as an estimate of Pr(B | A). The notation Pr(A), the 
probability of an item set A occurring in a basket, is an abbreviation for 
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P r (n feeA %k = !)■ The “expected confidence” is defined as the support of 
the consequent T(B ), which is an estimate of the unconditional probability 
Pr(S). Finally, the “lift” of the rule is defined as the confidence divided by 
the expected confidence 


L(A => B) 


C(A => B) 
T(B) 


This is an estimate of the association measure Pr(A and B)/Pr(A)Pr(B). 

As an example, suppose the item set/C = {peanut butter, jelly, bread} 
and consider the rule {peanut butter, jelly} => {bread}. A support value 
of 0.03 for this rule means that peanut butter, jelly, and bread appeared 
together in 3% of the market baskets. A confidence of 0.82 for this rule im¬ 
plies that when peanut butter and jelly were purchased, 82% of the time 
bread was also purchased. If bread appeared in 43% of all market baskets 
then the rule {peanut butter, jelly} => {bread} would have a lift of 1.95. 

The goal of this analysis is to produce association rules (14.7) with both 
high values of support and confidence (14.8). The Apriori algorithm returns 
all item sets with high support as defined by the support threshold t (14.6). 
A confidence threshold c is set, and all rules that can be formed from those 
item sets (14.6) with confidence greater than this value 


{A=> B\C(A^ B) > c} (14.9) 

are reported. For each item set /C of size |/C| there are 2^1 —1 — 1 rules of 
the form A => (JC — A), A C 1C. Agrawal et al. (1995) present a variant of 
the Apriori algorithm that can rapidly determine which rules survive the 
confidence threshold (14.9) from all possible rules that can be formed from 
the solution item sets (14.6). 

The output of the entire analysis is a collection of association rules (14.7) 
that satisfy the constraints 


T{A => B)>t and C(A => B) > c. 

These are generally stored in a data base that can be queried by the user. 
Typical requests might be to display the rules in sorted order of confidence, 
lift or support. More specifically, one might request such a list conditioned 
on particular items in the antecedent or especially the consequent. For 
example, a request might be the following: 

Display all transactions in which ice skates are the consequent 
that have confidence over 80% and support of more than 2%. 

This could provide information on those items (antecedent) that predicate 
sales of ice skates. Focusing on a particular consequent casts the problem 
into the framework of supervised learning. 

Association rules have become a popular tool for analyzing very large 
commercial data bases in settings where market basket is relevant. That is 
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when the data can be cast in the form of a multidimensional contingency 
table. The output is in the form of conjunctive rules (14.4) that are easily 
understood and interpreted. The Apriori algorithm allows this analysis to 
be applied to huge data bases, much larger that are amenable to other types 
of analyses. Association rules are among data mining’s biggest successes. 

Besides the restrictive form of the data to which they can be applied, as¬ 
sociation rules have other limitations. Critical to computational feasibility 
is the support threshold (14.6). The number of solution item sets, their size, 
and the number of passes required over the data can grow exponentially 
with decreasing size of this lower bound. Thus, rules with high confidence 
or lift, but low support, will not be discovered. For example, a high confi¬ 
dence rule such as vodka => caviar will not be uncovered owing to the low 
sales volume of the consequent caviar. 


14-2.3 Example: Market Basket Analysis 

We illustrate the use of Apriori on a moderately sized demographics data 
base. This data set consists of N — 9409 questionnaires filled out by shop¬ 
ping mall customers in the San Francisco Bay Area (Impact Resources, Inc., 
Columbus OH, 1987). Here we use answers to the first 14 questions, relat¬ 
ing to demographics, for illustration. These questions are listed in Table 
14.1. The data are seen to consist of a mixture of ordinal and (unordered) 
categorical variables, many of the latter having more than a few values. 
There are many missing values. 

We used a freeware implementation of the Apriori algorithm due to Chris¬ 
tian Borgelt 1 . After removing observations with missing values, each ordinal 
predictor was cut at its median and coded by two dummy variables; each 
categorical predictor with k categories was coded by k dummy variables. 
This resulted in a 6876 x 50 matrix of 6876 observations on 50 dummy 
variables. 

The algorithm found a total of 6288 association rules, involving < 5 
predictors, with support of at least 10%. Understanding this large set of 
rules is itself a challenging data analysis task. We will not attempt this here, 
but only illustrate in Figure 14.2 the relative frequency of each dummy 
variable in the data (top) and the association rules (bottom). Prevalent 
categories tend to appear more often in the rules, for example, the first 
category in language (English). However, others such as occupation are 
under-represented, with the exception of the first and fifth level. 

Here are three examples of association rules found by the Apriori algo¬ 
rithm: 

Association rule 1: Support 25%, confidence 99.7% and lift 1.03. 


1 See http: //fuzzy. cs.uni-magdeburg.de/~borgelt. 
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TABLE 14.1. Inputs for the demographic data. 


Feature 

Demographic 

# Values 

Type 

1 

Sex 

2 

Categorical 

2 

Marital status 

5 

Categorical 

3 

Age 

7 

Ordinal 

4 

Education 

6 

Ordinal 

5 

Occupation 

9 

Categorical 

6 

Income 

9 

Ordinal 

7 

Years in Bay Area 

5 

Ordinal 

8 

Dual incomes 

3 

Categorical 

9 

Number in household 

9 

Ordinal 

10 

Number of children 

9 

Ordinal 

11 

Householder status 

3 

Categorical 

12 

Type of home 

5 

Categorical 

13 

Ethnic classification 

8 

Categorical 

14 

Language in home 

3 

Categorical 


number in household = 1 

number of children = 0 


language in home = English 


Association rule 2: Support 13.4%, confidence 80.8%, and lift 2.13. 


language in home 
householder status 
occupation 


English 

own 

{professional/managerial} 


4 

income > $ 40,000 


Association rule 3: Support 26.5%, confidence 82.8% and lift 2.15. 


language in home = 
income < 
marital status = 
number of children = 

a- 


English 
$ 40,000 
not married 
0 


education ^ {college graduate, graduate study} 
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We chose the first and third rules based on their high support. The second 
rule is an association rule with a high-income consequent, and could be 
used to try to target high-income individuals. 

As stated above, we created dummy variables for each category of the 
input predictors, for example, Z\ = /(income < $40,000) and Z 2 = 
/(income > $40,000) for below and above the median income. If we were 
interested only in finding associations with the high-income category, we 
would include Z 2 but not Z \. This is often the case in actual market basket 
problems, where we are interested in finding associations with the presence 
of a relatively rare item, but not associations with its absence. 


14-2-4 Unsupervised as Supervised Learning 


Here we discuss a technique for transforming the density estimation prob¬ 
lem into one of supervised function approximation. This forms the basis 
for the generalized association rules described in the next section. 

Let g(x) be the unknown data probability density to be estimated, and 
go(x ) be a specified probability density function used for reference. For ex¬ 
ample, go(x) might be the uniform density over the range of the variables. 
Other possibilities are discussed below. The data set xi, X 2 , ■ ■ ■, Xn is pre¬ 
sumed to be an i.i.cL. random sample drawn from g{x). A sample of size N 0 
can be drawn from go(x) using Monte Carlo methods. Pooling these two 
data sets, and assigning mass w = N 0 /(N + N 0 ) to those drawn from g{x), 
and wo = N/(N + No) to those drawn from go(x), results in a random 
sample drawn from the mixture density ( g(x) + go(x)) /2. If one assigns 
the value Y = 1 to each sample point drawn from g[x) and Y = 0 those 
drawn from go(x), then 


g{x) = E{Y\x) 


ffO) 

g{x) +g 0 (x) 
g{x)/g 0 {x) 
l+g(x)/g 0 {x) 


(14.10) 


can be estimated by supervised learning using the combined sample 


(yi,xi), (y 2 , x 2 ), - • •, (vn+n 0 , xn+n 0 ) (14-11) 

as training data. The resulting estimate p-(x) can be inverted to provide an 
estimate for g(x) 

s w = 9o(:E) Tvfj5)' (1412) 

Generalized versions of logistic regression (Section 4.4) are especially well 
suited for this application since the log-odds, 


f{x) = log 


g(x) 

gUx ) 1 


are estimated directly. In this case one has 


(14.13) 
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<N 

* 



FIGURE 14.3. Density estimation via classification. (Left panel:) Training set 
of 200 data points. (Right panel:) Training set plus 200 reference data points, 
generated uniformly over the rectangle containing the training data. The training 
sample was labeled as class 1, and the reference sample class 0, and a semipara- 
metric logistic regression model was fit to the data. Some contours for g(x ) are 
shown. 


g(x) =go{x)e^ x ' ) . (14.14) 

An example is shown in Figure 14.3. We generated a training set of size 
200 shown in the left panel. The right panel shows the reference data (blue) 
generated uniformly over the rectangle containing the training data. The 
training sample was labeled as class 1, and the reference sample class 0, 
and a logistic regression model, using a tensor product of natural splines 
(Section 5.2.1), was fit to the data. Some probability contours of fi(x) are 
shown in the right panel; these are also the contours of the density estimate 
g(x ), since g{x) = fi(x)/( 1 — fi(x)), is a monotone function. The contours 
roughly capture the data density. 

In principle any reference density can be used for go(x) in (14.14). In 
practice the accuracy of the estimate g(x) can depend greatly on partic¬ 
ular choices. Good choices will depend on the data density g(x) and the 
procedure used to estimate (14.10) or (14.13). If accuracy is the goal, go(x) 
should be chosen so that the resulting functions g(x) or f{x) are approx¬ 
imated easily by the method being used. However, accuracy is not always 
the primary goal. Both g(x) and /( x) are monotonic functions of the den¬ 
sity ratio g(x)/go(x). They can thus be viewed as “contrast” statistics that 
provide information concerning departures of the data density g(x) from 
the chosen reference density go(x). Therefore, in data analytic settings, a 
choice for go(x) is dictated by types of departures that are deemed most 
interesting in the context of the specific problem at hand. For example, if 
departures from uniformity are of interest, go(x) might be the a uniform 
density over the range of the variables. If departures from joint normality 




14.2 Association Rules 


497 


are of interest, a good choice for go{x) would be a Gaussian distribution 
with the same mean vector and covariance matrix as the data. Departures 
from independence could be investigated by using 

p 

9o(x) = Y[(] J (x j ), (14.15) 

1=1 

where gj(xj) is the marginal data density of Xj, the jth coordinate of X. 
A sample from this independent density (14.15) is easily generated from the 
data itself by applying a different random permutation to the data values 
of each of the variables. 

As discussed above, unsupervised learning is concerned with revealing 
properties of the data density g(x). Each technique focuses on a particu¬ 
lar property or set of properties. Although this approach of transforming 
the problem to one of supervised learning (14.10)-(14.14) seems to have 
been part of the statistics folklore for some time, it does not appear to 
have had much impact despite its potential to bring well-developed su¬ 
pervised learning methodology to bear on unsupervised learning problems. 
One reason may be that the problem must be enlarged with a simulated 
data set generated by Monte Carlo techniques. Since the size of this data 
set should be at least as large as the data sample N 0 > N, the compu¬ 
tation and memory requirements of the estimation procedure are at least 
doubled. Also, substantial computation may be required to generate the 
Monte Carlo sample itself. Although perhaps a deterrent in the past, these 
increased computational requirements are becoming much less of a burden 
as increased resources become routinely available. We illustrate the use of 
supervising learning methods for unsupervised learning in the next section. 


14-2.5 Generalized Association Rules 

The more general problem (14.2) of finding high-density regions in the data 
space can be addressed using the supervised learning approach described 
above. Although not applicable to the huge data bases for which market 
basket analysis is feasible, useful information can be obtained from mod¬ 
erately sized data sets. The problem (14.2) can be formulated as finding 
subsets of the integers J C {1,2,,p} and corresponding value subsets 
Sj, j & J for the corresponding variables Xj , such that 


Pr 


P| (Xj e Sj) 


N 




P ( x ij £ Sj) 


(14.16) 


is large. Following the nomenclature of association rule analysis, {(X, £ 
Sj)}jej will be called a “generalized” item set. The subsets Sj correspond¬ 
ing to quantitative variables are taken to be contiguous intervals within 
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their range of values, and subsets for categorical variables can involve more 
than a single value. The ambitious nature of this formulation precludes a 
thorough search for all generalized item sets with support (14.16) greater 
than a specified minimum threshold, as was possible in the more restric¬ 
tive setting of market basket analysis. Heuristic search methods must be 
employed, and the most one can hope for is to find a useful collection of 
such generalized item sets. 

Both market basket analysis (14.5) and the generalized formulation (14.16) 
implicitly reference the uniform probability distribution. One seeks item 
sets that are more frequent than would be expected if all joint data values 
(xi, X2, ■ ■ •, xjv) were uniformly distributed. This favors the discovery of 
item sets whose marginal constituents (Xj £ Sj) are individually frequent, 
that is, the quantity 

1 N 

I( X ij S Sj) (14.17) 

i=1 

is large. Conjunctions of frequent subsets (14.17) will tend to appear more 
often among item sets of high support (14.16) than conjunctions of margin¬ 
ally less frequent subsets. This is why the rule vodka => caviar is not likely 
to be discovered in spite of a high association (lift); neither item has high 
marginal support, so that their joint support is especially small. Reference 
to the uniform distribution can cause highly frequent item sets with low 
associations among their constituents to dominate the collection of highest 
support item sets. 

Highly frequent subsets Sj are formed as disjunctions of the most fre¬ 
quent Aj-values. Using the product of the variable marginal data densities 
(14.15) as a reference distribution removes the preference for highly fre¬ 
quent values of the individual variables in the discovered item sets. This is 
because the density ratio g(x)/go(x) is uniform if there are no associations 
among the variables (complete independence), regardless of the frequency 
distribution of the individual variable values. Rules like vodka => caviar 
would have a chance to emerge. It is not clear however, how to incorporate 
reference distributions other than the uniform into the Apriori algorithm. 
As explained in Section 14.2.4, it is straightforward to generate a sample 
from the product density (14.15), given the original data set. 

After choosing a reference distribution, and drawing a sample from it 
as in (14.11), one has a supervised learning problem with a binary-valued 
output variable Y £ {0,1}. The goal is to use this training data to find 
regions 

r = c i (*> e s i) ( i4 - is ) 

jeJ 

for which the target function g(x) = E(Y \ x) is relatively large. In addition, 
one might wish to require that the data support of these regions 
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T(R) = / g(x)dx 
J x£R 


(14.19) 


not be too small. 

14-2.6 Choice of Supervised Learning Method 

The regions (14.18) are defined by conjunctive rules. Hence supervised 
methods that learn such rules would be most appropriate in this context. 
The terminal nodes of a CART decision tree are defined by rules precisely 
of the form (14.18). Applying CART to the pooled data (14.11) will pro¬ 
duce a decision tree that attempts to model the target (14.10) over the 
entire data space by a disjoint set of regions (terminal nodes). Each region 
is defined by a rule of the form (14.18). Those terminal nodes t with high 
average y -values 

Vt = a ve(yi \ Xi € t) 

are candidates for high-support generalized item sets (14.16). The actual 
(data) support is given by 


T(R) = yt ■ 


N t 

N + N 0 ’ 


where N t is the number of (pooled) observations within the region repre¬ 
sented by the terminal node. By examining the resulting decision tree, one 
might discover interesting generalized item sets of relatively high-support. 
These can then be partitioned into antecedents and consequents in a search 
for generalized association rules of high confidence and/or lift. 

Another natural learning method for this purpose is the patient rule 
induction method PRIM described in Section 9.3. PRIM also produces 
rules precisely of the form (14.18), but it is especially designed for finding 
high-support regions that maximize the average target (14.10) value within 
them, rather than trying to model the target function over the entire data 
space. It also provides more control over the support/average-target-value 
tradeoff. 

Exercise 14.3 addresses an issue that arises with either of these methods 
when we generate random data from the product of the marginal distribu¬ 
tions. 


14-2.7 Example: Market Basket Analysis (Continued) 

We illustrate the use of PRIM on the demographics data of Table 14.1. 

Three of the high-support generalized item sets emerging from the PRIM 
analysis were the following: 

Item set 1: Support= 24%. 
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marital status = married 
householder status = own 

type of home ^ apartment 

Item set 2: Support= 24%. 

age < 24 

marital status £ {living together-not married, single} 

occupation ^ {professional, homemaker, retired} 

householder status £ {rent, live with family} 

Item set 3: Support= 15%. 

householder status = rent 
type of home ^ house 
number in household < 2 

number of children = 0 

occupation {homemaker, student, unemployed} 
income £ [$20,000, $150,000] 

Generalized association rules derived from these item sets with confidence 
(14.8) greater than 95% are the following: 

Association rule 1: Support 25%, confidence 99.7% and lift 1.35. 

marital status = married 
householder status = own 

type of home ^ apartment 

Association rule 2: Support 25%, confidence 98.7% and lift 1.97. 

age < 24 

occupation ^ {professional, homemaker, retired} 
householder status £ {rent, live with family} 

marital status £ {single, living together-not married} 

Association rule 3: Support 25%, confidence 95.9% and lift 2.61. 

householder status = own 

type of home ^ apartment 

marital status = married 
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Association rule 4: Support 15%, confidence 95.4% and lift 1.50. 


householder status 

= 

rent 

type of home 


house 

number in household 

< 

2 

occupation 


{homemaker, student, unemployed} 

income 

e 

[$20,000, $150,000] 



4 


number of children = 0 


There are no great surprises among these particular rules. For the most 
part they verify intuition. In other contexts where there is less prior in¬ 
formation available, unexpected results have a greater chance to emerge. 
These results do illustrate the type of information generalized association 
rules can provide, and that the supervised learning approach, coupled with 
a ruled induction method such as CART or PRIM, can uncover item sets 
exhibiting high associations among their constituents. 

How do these generalized association rules compare to those found earlier 
by the Apriori algorithm? Since the Apriori procedure gives thousands of 
rules, it is difficult to compare them. However some general points can be 
made. The Apriori algorithm is exhaustive—it finds all rules with support 
greater than a specified amount. In contrast, PRIM is a greedy algorithm 
and is not guaranteed to give an “optimal” set of rules. On the other hand, 
the Apriori algorithm can deal only with dummy variables and hence could 
not find some of the above rules. For example, since type of home is a 
categorical input, with a dummy variable for each level, Apriori could not 
find a rule involving the set 

type of home ^ apartment. 

To find this set, we would have to code a dummy variable for apartment 
versus the other categories of type of home. It will not generally be feasible 
to precode all such potentially interesting comparisons. 


14.3 Cluster Analysis 

Cluster analysis, also called data segmentation, has a variety of goals. All 
relate to grouping or segmenting a collection of objects into subsets or 
“clusters,” such that those within each cluster are more closely related to 
one another than objects assigned to different clusters. An object can be 
described by a set of measurements, or by its relation to other objects. 
In addition, the goal is sometimes to arrange the clusters into a natural 
hierarchy. This involves successively grouping the clusters themselves so 
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Xi 


FIGURE 14.4. Simulated data in the plane, clustered into three classes (repre¬ 
sented by orange, blue and green) by the K-means clustering algorithm 


that at each level of the hierarchy, clusters within the same group are more 
similar to each other than those in different groups. 

Cluster analysis is also used to form descriptive statistics to ascertain 
whether or not the data consists of a set distinct subgroups, each group 
representing objects with substantially different properties. This latter goal 
requires an assessment of the degree of difference between the objects as¬ 
signed to the respective clusters. 

Central to all of the goals of cluster analysis is the notion of the degree of 
similarity (or dissimilarity) between the individual objects being clustered. 
A clustering method attempts to group the objects based on the definition 
of similarity supplied to it. This can only come from subject matter consid¬ 
erations. The situation is somewhat similar to the specification of a loss or 
cost function in prediction problems (supervised learning). There the cost 
associated with an inaccurate prediction depends on considerations outside 
the data. 

Figure 14.4 shows some simulated data clustered into three groups via 
the popular IF-means algorithm. In this case two of the clusters are not 
well separated, so that “segmentation” more accurately describes the part 
of this process than “clustering.” AT-means clustering starts with guesses 
for the three cluster centers. Then it alternates the following steps until 
convergence: 

• for each data point, the closest cluster center (in Euclidean distance) 
is identified; 
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• each cluster center is replaced by the coordinate-wise average of all 
data points that are closest to it. 

We describe AT-means clustering in more detail later, including the prob¬ 
lem of how to choose the number of clusters (three in this example). K- 
means clustering is a top-down procedure, while other cluster approaches 
that we discuss are bottom-up. Fundamental to all clustering techniques is 
the choice of distance or dissimilarity measure between two objects. We 
first discuss distance measures before describing a variety of algorithms for 
clustering. 

14-3.1 Proximity Matrices 

Sometimes the data is represented directly in terms of the proximity (alike- 
ness or affinity) between pairs of objects. These can be either similarities or 
dissimilarities (difference or lack of affinity). For example, in social science 
experiments, participants are asked to judge by how much certain objects 
differ from one another. Dissimilarities can then be computed by averaging 
over the collection of such judgments. This type of data can be represented 
by an N x TV matrix D, where TV is the number of objects, and each element 
da’ records the proximity between the fth and i'th objects. This matrix is 
then provided as input to the clustering algorithm. 

Most algorithms presume a matrix of dissimilarities with nonnegative 
entries and zero diagonal elements: du = 0, i = 1,2,..., Af. If the original 
data were collected as similarities, a suitable monotone-decreasing function 
can be used to convert them to dissimilarities. Also, most algorithms as¬ 
sume symmetric dissimilarity matrices, so if the original matrix D is not 
symmetric it must be replaced by (D + D 3 )/2. Subjectively judged dissimi¬ 
larities are seldom distances in the strict sense, since the triangle inequality 
dw < dik+di'k, for all k £ {1,..., TV} does not hold. Thus, some algorithms 
that assume distances cannot be used with such data. 

14-3.2 Dissimilarities Based on Attributes 

Most often we have measurements Xij for i = 1,2,...,TV, on variables 
j = 1, 2 ,... ,p (also called attributes). Since most of the popular clustering 
algorithms take a dissimilarity matrix as their input, we must first construct 
pairwise dissimilarities between the observations. In the most common case, 
we define a dissimilarity dj(xij,xpj) between values of the }th attribute, 
and then define 

v 

D(xi,xp) ^ . dj ( Xjj , xpj ) (14.20) 

j=i 

as the dissimilarity between objects i and i'. By far the most common 
choice is squared distance 


504 


14. Unsupervised Learning 



(14.21) 


However, other choices are possible, and can lead to potentially different 
results. For nonquantitative attributes (e.g., categorical data), squared dis¬ 
tance may not be appropriate. In addition, it is sometimes desirable to 
weigh attributes differently rather than giving them equal weight as in 
(14.20). 

We first discuss alternatives in terms of the attribute type: 

Quantitative variables. Measurements of this type of variable or attribute 
are represented by continuous real-valued numbers. It is natural to 
define the “error” between them as a monotone-increasing function 
of their absolute difference 


d(xi,Xi>) = l(\xi - xv |). 


Besides squared-error loss (xi — Xi>) 2 , a common choice is the identity 
(absolute error). The former places more emphasis on larger differ¬ 
ences than smaller ones. Alternatively, clustering can be based on the 
correlation 



(14.22) 


with Xi = Xij/p. Note that this is averaged over variables, not ob¬ 
servations. If the observations are first standardized, then ~ 

Xi'j) 2 oc 2(1 — p(xi, XiQ). Hence clustering based on correlation (simi¬ 
larity) is equivalent to that based on squared distance (dissimilarity). 

Ordinal variables. The values of this type of variable are often represented 
as contiguous integers, and the realizable values are considered to be 
an ordered set. Examples are academic grades (A, B, C, D, F), degree 
of preference (can’t stand, dislike, OK, like, terrific). Rank data are a 
special kind of ordinal data. Error measures for ordinal variables are 
generally defined by replacing their M original values with 



(14.23) 


in the prescribed order of their original values. They are then treated 
as quantitative variables on this scale. 

Categorical variables. With unordered categorical (also called nominal) 
variables, the degree-of-difference between pairs of values must be 
delineated explicitly. If the variable assumes M distinct values, these 
can be arranged in a symmetric M x M matrix with elements L rr i = 
L r > r ,L rr = 0 ,L rr t > 0. The most common choice is L rr i = 1 for all 
r ^ r' , while unequal losses can be used to emphasize some errors 
more than others. 
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14-3.3 Object Dissimilarity 

Next we define a procedure for combining the p-individual attribute dissim¬ 
ilarities dj(xij, Xi'j), j = 1,2,... ,p into a single overall measure of dissim¬ 
ilarity D{xi,Xi') between two objects or observations (xj,Xji) possessing 
the respective attribute values. This is nearly always done by means of a 
weighted average (convex combination) 

p p 

D(xi,Xj>) = y^Wj ■ dj(xij,Xi'j); ^Wj = 1. (14.24) 

t=i l=i 

Here Wj is a weight assigned to the jth attribute regulating the relative 
influence of that variable in determining the overall dissimilarity between 
objects. This choice should be based on subject matter considerations. 

It is important to realize that setting the weight Wj to the same value 
for each variable (say, Wj = 1 V j ) does not necessarily give all attributes 
equal influence. The influence of the jth attribute Xj on object dissimilarity 
D(xi,Xi') (14.24) depends upon its relative contribution to the average 
object dissimilarity measure over all pairs of observations in the data set 

j N N p 

D = D{xi,Xi') = Wj ■ dj, 

1=1 2 ' = 1 j = 1 

with 

j jv s 

dj = dj(xij,Xj'j) (14.25) 

2—1 i'=l 

being the average dissimilarity on the jth attribute. Thus, the relative in¬ 
fluence of the jth variable is Wj ■ dj, and setting Wj ~ 1 /dj would give all 
attributes equal influence in characterizing overall dissimilarity between ob¬ 
jects. For example, with p quantitative variables and squared-error distance 
used for each coordinate, then (14.24) becomes the (weighted) squared Eu¬ 
clidean distance 


p 

Di{xi,Xi>) ^ Wj • (Xij x^fj'j (14.26) 

j =i 

between pairs of points in an IR P , with the quantitative variables as axes. 
In this case (14.25) becomes 

1 N N 

^ = - x i'jf = 2 • var ? - ( 14 - 27 ) 

2=1 i' — \ 

where van, is the sample estimate of Xar(Xj). Thus, the relative impor¬ 
tance of each such variable is proportional to its variance over the data 
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FIGURE 14.5. Simulated data: on the left, K-means clustering (with K=2) has 
been applied to the raw data. The two colors indicate the cluster memberships. On 
the right, the features were first standardized before clustering. This is equivalent 
to using feature weights l/[2-var(Xj)]. The standardization has obscured the two 
well-separated groups. Note that each plot uses the same units in the horizontal 
and vertical axes. 


set. In general, setting Wj = 1/d.j for all attributes, irrespective of type, 
will cause each one of them to equally influence the overall dissimilarity 
between pairs of objects Although this may seem reasonable, and 

is often recommended, it can be highly counterproductive. If the goal is to 
segment the data into groups of similar objects, all attributes may not con¬ 
tribute equally to the (problem-dependent) notion of dissimilarity between 
objects. Some attribute value differences may reflect greater actual object 
dissimilarity in the context of the problem domain. 

If the goal is to discover natural groupings in the data, some attributes 
may exhibit more of a grouping tendency than others. Variables that are 
more relevant in separating the groups should be assigned a higher influ¬ 
ence in defining object dissimilarity. Giving all attributes equal influence 
in this case will tend to obscure the groups to the point where a clustering 
algorithm cannot uncover them. Figure 14.5 shows an example. 

Although simple generic prescriptions for choosing the individual at¬ 
tribute dissimilarities dj(xij,Xi>j) and their weights Wj can be comforting, 
there is no substitute for careful thought in the context of each individ¬ 
ual problem. Specifying an appropriate dissimilarity measure is far more 
important in obtaining success with clustering than choice of clustering 
algorithm. This aspect of the problem is emphasized less in the cluster¬ 
ing literature than the algorithms themselves, since it depends on domain 
knowledge specifics and is less amenable to general research. 
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Finally, often observations have missing values in one or more of the 
attributes. The most common method of incorporating missing values in 
dissimilarity calculations (14.24) is to omit each observation pair Xij,Xi>j 
having at least one value missing, when computing the dissimilarity be¬ 
tween observations Xi and x[. This method can fail in the circumstance 
when both observations have no measured values in common. In this case 
both observations could be deleted from the analysis. Alternatively, the 
missing values could be imputed using the mean or median of each attribute 
over the nonmissing data. For categorical variables, one could consider the 
value “missing” as just another categorical value, if it were reasonable to 
consider two objects as being similar if they both have missing values on 
the same variables. 

14 -3.4 Clustering Algorithms 

The goal of cluster analysis is to partition the observations into groups 
(“clusters”) so that the pairwise dissimilarities between those assigned to 
the same cluster tend to be smaller than those in different clusters. Clus¬ 
tering algorithms fall into three distinct types: combinatorial algorithms, 
mixture modeling, and mode seeking. 

Combinatorial algorithms work directly on the observed data with no 
direct reference to an underlying probability model. Mixture modeling sup¬ 
poses that the data is an i.i.d sample from some population described by a 
probability density function. This density function is characterized by a pa¬ 
rameterized model taken to be a mixture of component density functions; 
each component density describes one of the clusters. This model is then fit 
to the data by maximum likelihood or corresponding Bayesian approaches. 
Mode seekers (“bump hunters”) take anonparametric perspective, attempt¬ 
ing to directly estimate distinct modes of the probability density function. 
Observations “closest” to each respective mode then define the individual 
clusters. 

Mixture modeling is described in Section 6.8. The PRIM algorithm, dis¬ 
cussed in Sections 9.3 and 14.2.5, is an example of mode seeking or “bump 
hunting.” We discuss combinatorial algorithms next. 

14-3.5 Combinatorial Algorithms 

The most popular clustering algorithms directly assign each observation 
to a group or cluster without regard to a probability model describing the 
data. Each observation is uniquely labeled by an integer i £ {1, ■ • -,N}. 
A prespecified number of clusters K < N is postulated, and each one is 
labeled by an integer k € {1,..., K}. Each observation is assigned to one 
and only one cluster. These assignments can be characterized by a many- 
to-one mapping, or encoder k = C(i), that assigns the fth observation to 
the kth cluster. One seeks the particular encoder C*[i) that achieves the 
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required goal (details below), based on the dissimilarities d[xi, Xj/) between 
every pair of observations. These are specified by the user as described 
above. Generally, the encoder C(i) is explicitly delineated by giving its 
value (cluster assignment) for each observation i. Thus, the “parameters” 
of the procedure are the individual cluster assignments for each of the N 
observations. These are adjusted so as to minimize a “loss” function that 
characterizes the degree to which the clustering goal is not met. 

One approach is to directly specify a mathematical loss function and 
attempt to minimize it through some combinatorial optimization algorithm. 
Since the goal is to assign close points to the same cluster, a natural loss 
(or “energy”) function would be 

1 K 

W(C)=- E E E d( Xi ,x it ). (14.28) 

Z fc=1 C(i)=k C(i')=k 

This criterion characterizes the extent to which observations assigned to 
the same cluster tend to be close to one another. It is sometimes referred 
to as the “within cluster” point scatter since 

n n i< ( \ 

T = 2 55 E = 2 55 55 E + E d »' * 

i=l*'=l k=l C(i)=k \c{i , )=k C{i')^k ) 

or 

T=W(C) + B(C), 

where da' = d(xi, Xi>). Here T is the total point scatter, which is a constant 
given the data, independent of cluster assignment. The quantity 

b ( c ) = \J2 E E ( 14 - 29 ) 

fc=l C{i)=k C(i')^k 

is the between-cluster point scatter. This will tend to be large when obser¬ 
vations assigned to different clusters are far apart. Thus one has 

W{C) = T - B(C) 

and minimizing W(C) is equivalent to maximizing B(C ). 

Cluster analysis by combinatorial optimization is straightforward in prin¬ 
ciple. One simply minimizes W or equivalently maximizes B over all pos¬ 
sible assignments of the N data points to K clusters. Unfortunately, such 
optimization by complete enumeration is feasible only for very small data 
sets. The number of distinct assignments is (Jain and Dubes, 1988) 

S(N,K) = E E(-i)* -fc (t) kN • (14 - 30) 

' k~\ ' ' 

For example, 5(10,4) = 34,105 which is quite feasible. But, S(N, K ) grows 
very rapidly with increasing values of its arguments. Already 5(19,4) ~ 
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10 10 , and most clustering problems involve much larger data sets than 
N = 19. For this reason, practical clustering algorithms are able to examine 
only a very small fraction of all possible encoders k = C{i). The goal is to 
identify a small subset that is likely to contain the optimal one, or at least 
a good suboptimal partition. 

Such feasible strategies are based on iterative greedy descent. An initial 
partition is specified. At each iterative step, the cluster assignments are 
changed in such a way that the value of the criterion is improved from 
its previous value. Clustering algorithms of this type differ in their pre¬ 
scriptions for modifying the cluster assignments at each iteration. When 
the prescription is unable to provide an improvement, the algorithm ter¬ 
minates with the current assignments as its solution. Since the assignment 
of observations to clusters at any iteration is a perturbation of that for the 
previous iteration, only a very small fraction of all possible assignments 
(14.30) are examined. However, these algorithms converge to local optima 
which may be highly suboptimal when compared to the global optimum. 


14-3.6 K-means 

The If-means algorithm is one of the most popular iterative descent clus¬ 
tering methods. It is intended for situations in which all variables are of 
the quantitative type, and squared Euclidean distance 

p 

d(Xj,Xj') — ^ ) (Xij Xi'j ) — 11 Xi Xi' 11 

1=1 

is chosen as the dissimilarity measure. Note that weighted Euclidean dis¬ 
tance can be used by redefining the x- t j values (Exercise 14.1). 

The within-point scatter (14.28) can be written as 

w(c) = J2 E n*i-*i'ii 2 

fc=lC(i)=kC(i')=fc 

K 

= J2 Nk E n*i-**n 2 . ( l4 - 31 ) 

fe=l C(i)=fc 

where Xk = (x±k, • • •, x P k ) is the mean vector associated with the fcth clus¬ 
ter, and Nk = I{C(i) = k). Thus, the criterion is minimized by 

assigning the N observations to the K clusters in such a way that within 
each cluster the average dissimilarity of the observations from the cluster 
mean, as defined by the points in that cluster, is minimized. 

An iterative descent algorithm for solving 
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Algorithm 14.1 K-means Clustering. 

1. For a given cluster assignment C, the total cluster variance (14.33) is 
minimized with respect to {mi,... ,m^} yielding the means of the 
currently assigned clusters (14.32). 

2. Given a current set of means {mi,..., m*-}, (14.33) is minimized by 
assigning each observation to the closest (current) cluster mean. That 
is, 

C(i) = argmin \\xi — mk\\ 2 ■ (14.34) 

l<k<K 

3. Steps 1 and 2 are iterated until the assignments do not change. 


K 


C* = mm 
c 


^2 \\ X i~ X k \\ 2 
fc=l C{i)—k 

can be obtained by noting that for any set of observations S 

xs = argmin ||Xi — m|| 2 . 
m ies 

Hence we can obtain C* by solving the enlarged optimization problem 

K 


(14.32) 


mm 




GfmGf k=l c{i)=k 


\Xi - m k \ 


(14.33) 


This can be minimized by an alternating optimization procedure given in 
Algorithm 14.1. 

Each of steps 1 and 2 reduces the value of the criterion (14.33), so that 
convergence is assured. However, the result may represent a suboptimal 
local minimum. The algorithm of Hartigan and Wong (1979) goes further, 
and ensures that there is no single switch of an observation from one group 
to another group that will decrease the objective. In addition, one should 
start the algorithm with many different random choices for the starting 
means, and choose the solution having smallest value of the objective func¬ 
tion. 

Figure 14.6 shows some of the A-means iterations for the simulated data 
of Figure 14.4. The centroids are depicted by “0”s. The straight lines show 
the partitioning of points, each sector being the set of points closest to 
each centroid. This partitioning is called the Voronoi tessellation. After 20 
iterations the procedure has converged. 


14-3.7 Gaussian Mixtures as Soft K-means Clustering 

The A-means clustering procedure is closely related to the EM algorithm 
for estimating a certain Gaussian mixture model. (Sections 6.8 and 8.5.1). 
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FIGURE 14.6. Successive iterations of the K-means clustering algorithm for 
the simulated data of Figure 14-4- 
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cr = 1.0 
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FIGURE 14.7. (Left panels:) two Gaussian densities go(x) and gi(x) (blue and 
orange) on the real line, and a single data point (green dot) atx = 0.5. The colored 
squares are plotted at x = —1.0 and x = 1.0, the means of each density. (Right 
panels:) the relative densities go{x)/(go(x) + gi(x)) and gi{x)/(go(x) + gi(x)), 
called the “responsibilities” of each cluster, for this data point. In the top panels, 
the Gaussian standard deviation cr = 1.0; in the bottom panels cr = 0.2. The 
EM algorithm uses these responsibilities to make a “soft” assignment of each 
data point to each of the two clusters. When a is fairly large, the responsibilities 
can be near 0.5 (they are 0.36 and 0.6^ in the top right panel). As cr —> 0, the 
responsibilities —> 1, for the cluster center closest to the target point, and 0 for 
all other clusters. This “hard” assignment is seen in the bottom right panel. 

The E-step of the EM algorithm assigns “responsibilities” for each data 
point based in its relative density under each mixture component, while 
the M-step recomputes the component density parameters based on the 
current responsibilities. Suppose we specify K mixture components, each 
with a Gaussian density having scalar covariance matrix cr 2 I. Then the 
relative density under each mixture component is a monotone function of 
the Euclidean distance between the data point and the mixture center. 
Hence in this setup EM is a “soft” version of AT-means clustering, making 
probabilistic (rather than deterministic) assignments of points to cluster 
centers. As the variance cr 2 —>- 0, these probabilities become 0 and 1, and 
the two methods coincide. Details are given in Exercise 14.2. Figure 14.7 
illustrates this result for two clusters on the real line. 


14-3.8 Example: Human Tumor Microarray Data 

We apply AT-means clustering to the human tumor microarray data de¬ 
scribed in Chapter 1. This is an example of high-dimensional clustering. 
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FIGURE 14.8. Total within-cluster sum of squares for K-means clustering ap¬ 
plied to the human tumor microarray data. 


TABLE 14.2. Human tumor data: number of cancer cases of each type, in each 
of the three clusters from K-means clustering. 


Cluster 

Breast 

CNS 

Colon 

K562 

Leukemia 

MCF7 

1 

3 

5 

0 

0 

0 

0 

2 

2 

0 

0 

2 

6 

2 

3 

2 

0 

7 

0 

0 

0 

Cluster 

Melanoma 

NSCLC 

Ovarian 

Prostate 

Renal 

Unknown 

1 

1 

7 

6 

2 

9 

1 

2 

7 

2 

0 

0 

0 

0 

3 

0 

0 

0 

0 

0 

0 


The data are a 6830 x 64 matrix of real numbers, each representing an 
expression measurement for a gene (row) and sample (column). Here we 
cluster the samples, each of which is a vector of length 6830, correspond¬ 
ing to expression values for the 6830 genes. Each sample has a label such 
as breast (for breast cancer), melanoma, and so on; we don’t use these la¬ 
bels in the clustering, but will examine posthoc which labels fall into which 
clusters. 

We applied A"-means clustering with K running from 1 to 10, and com¬ 
puted the total within-sum of squares for each clustering, shown in Fig¬ 
ure 14.8. Typically one looks for a kink in the sum of squares curve (or its 
logarithm) to locate the optimal number of clusters (see Section 14.3.11). 
Here there is no clear indication: for illustration we chose K = 3 giving the 
three clusters shown in Table 14.2. 
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FIGURE 14.9. Sir Ronald A. Fisher (1890 — 1962,) was one of the founders 
of modern day statistics, to whom we owe maximum-likelihood, sufficiency, and 
many other fundamental concepts. The image on the left is a 1024 x 1024 grayscale 
image at 8 bits per pixel. The center image is the result of 2x2 block VQ, using 
200 code vectors, with a compression rate of 1.9 bits/pixel. The right image uses 
only four code vectors, with a compression rate of 0.50 bits/pixel 


We see that the procedure is successful at grouping together samples of 
the same cancer. In fact, the two breast cancers in the second cluster were 
later found to be misdiagnosed and were melanomas that had metastasized. 
However, RT-means clustering has shortcomings in this application. For one, 
it does not give a linear ordering of objects within a cluster: we have simply 
listed them in alphabetic order above. Secondly, as the number of clusters 
K is changed, the cluster memberships can change in arbitrary ways. That 
is, with say four clusters, the clusters need not be nested within the three 
clusters above. For these reasons, hierarchical clustering (described later), 
is probably preferable for this application. 


14-3.9 Vector Quantization 

The K -means clustering algorithm represents a key tool in the apparently 
unrelated area of image and signal compression, particularly in vector quan¬ 
tization or VQ (Gersho and Gray, 1992). The left image in Figure 14.9 2 is a 
digitized photograph of a famous statistician, Sir Ronald Fisher. It consists 
of 1024 x 1024 pixels, where each pixel is a grayscale value ranging from 0 
to 255, and hence requires 8 bits of storage per pixel. The entire image oc¬ 
cupies 1 megabyte of storage. The center image is a VQ-compressed version 
of the left panel, and requires 0.239 of the storage (at some loss in quality). 
The right image is compressed even more, and requires only 0.0625 of the 
storage (at a considerable loss in quality). 

The version of VQ implemented here first breaks the image into small 
blocks, in this case 2x2 blocks of pixels. Each of the 512 x 512 blocks of four 


2 This example was prepared by Maya Gupta. 
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numbers is regarded as a vector in IR 4 . A iC-means clustering algorithm 
(also known as Lloyd’s algorithm in this context) is run in this space. 
The center image uses K = 200, while the right image K = 4. Each of 
the 512 x 512 pixel blocks (or points) is approximated by its closest cluster 
centroid, known as a codeword. The clustering process is called the encoding 
step, and the collection of centroids is called the codebook. 

To represent the approximated image, we need to supply for each block 
the identity of the codebook entry that approximates it. This will require 
log 2 (AT) bits per block. We also need to supply the codebook itself, which 
is AT x 4 real numbers (typically negligible). Overall, the storage for the 
compressed image amounts to log 2 (A')/(4 • 8) of the original (0.239 for 
K = 200, 0.063 for AT = 4). This is typically expressed as a rate in bits 
per pixel: log 2 (Af)/4, which are 1.91 and 0.50, respectively. The process 
of constructing the approximate image from the centroids is called the 
decoding step. 

Why do we expect VQ to work at all? The reason is that for typical 
everyday images like photographs, many of the blocks look the same. In 
this case there are many almost pure white blocks, and similarly pure gray 
blocks of various shades. These require only one block each to represent 
them, and then multiple pointers to that block. 

What we have described is known as lossy compression, since our im¬ 
ages are degraded versions of the original. The degradation or distortion is 
usually measured in terms of mean squared error. In this case D = 0.89 
for K = 200 and D = 16.95 for K = 4. More generally a rate/distortion 
curve would be used to assess the tradeoff. One can also perform lossless 
compression using block clustering, and still capitalize on the repeated pat¬ 
terns. If you took the original image and losslessly compressed it, the best 
you would do is 4.48 bits per pixel. 

We claimed above that log 2 (A') bits were needed to identify each of the AT 
codewords in the codebook. This uses a fixed-length code, and is inefficient 
if some codewords occur many more times than others in the image. Using 
Shannon coding theory, we know that in general a variable length code 
will do better, and the rate then becomes — Vi log 2 (p*?)/4- The term 
in the numerator is the entropy of the distribution pe of the codewords 
in the image. Using variable length coding our rates come down to 1.42 
and 0.39, respectively. Finally, there are many generalizations of VQ that 
have been developed: for example, tree-structured VQ finds the centroids 
with a top-down, 2-means style algorithm, as alluded to in Section 14.3.12. 
This allows successive refinement of the compression. Further details may 
be found in Gersho and Gray (1992). 

14-3.10 K-medoids 

As discussed above, the A'-means algorithm is appropriate when the dis¬ 
similarity measure is taken to be squared Euclidean distance D{xi,Xi') 
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Algorithm 14.2 K-medoids Clustering. 

1. For a given cluster assignment C find the observation in the cluster 
minimizing total distance to other points in that cluster: 

i\ = argmin D(xi,Xi>). (14.35) 

{i:C(i)=k} c ^,y— k 

Then m k = x^». k = 1,2,..., A' are the current estimates of the 
cluster centers. 

2. Given a current set of cluster centers {m i,..., mx}, minimize the to¬ 
tal error by assigning each observation to the closest (current) cluster 
center: 

C(i) = argmin D (a;,, m*,). (14.36) 

l<k<K 

3. Iterate steps 1 and 2 until the assignments do not change. 


(14.112). This requires all of the variables to be of the quantitative type. In 
addition, using squared Euclidean distance places the highest influence on 
the largest distances. This causes the procedure to lack robustness against 
outliers that produce very large distances. These restrictions can be re¬ 
moved at the expense of computation. 

The only part of the AT-means algorithm that assumes squared Eu¬ 
clidean distance is the minimization step (14.32); the cluster representatives 
{mi,..., nix} in (14.33) are taken to be the means of the currently assigned 
clusters. The algorithm can be generalized for use with arbitrarily defined 
dissimilarities D{xi,Xi') by replacing this step by an explicit optimization 
with respect to {m i,... ,mj{} in (14.33). In the most common form, cen¬ 
ters for each cluster are restricted to be one of the observations assigned 
to the cluster, as summarized in Algorithm 14.2. This algorithm assumes 
attribute data, but the approach can also be applied to data described 
only by proximity matrices (Section 14.3.1). There is no need to explicitly 
compute cluster centers; rather we just keep track of the indices i* k . 

Solving (14.32) for each provisional cluster k requires an amount of com¬ 
putation proportional to the number of observations assigned to it, whereas 
for solving (14.35) the computation increases to 0(N k ). Given a set of clus¬ 
ter “centers,” {*i,... ,ix}: obtaining the new assignments 

C{i) = argmin da* (14.37) 

1 <k<K k 

requires computation proportional to AT • N as before. Thus, A-medoids is 
far more computationally intensive than AT-means. 

Alternating between (14.35) and (14.37) represents a particular heuristic 
search strategy for trying to solve 
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TABLE 14.3. Data from a political science survey: values are average pairwise 
dissimilarities of countries from a questionnaire given to political science students. 



BEL 

BRA 

CHI 

CUB 

EGY 

FRA 

IND 

ISR 

USA 

USS 

YUG 

BRA 

5.58 











CHI 

7.00 

6.50 










CUB 

7.08 

7.00 

3.83 









EGY 

4.83 

5.08 

8.17 

5.83 








FRA 

2.17 

5.75 

6.67 

6.92 

4.92 







IND 

6.42 

5.00 

5.58 

6.00 

4.67 

6.42 






ISR 

3.42 

5.50 

6.42 

6.42 

5.00 

3.92 

6.17 





USA 

2.50 

4.92 

6.25 

7.33 

4.50 

2.25 

6.33 

2.75 




USS 

6.08 

6.67 

4.25 

2.67 

6.00 

6.17 

6.17 

6.92 

6.17 



YUG 

5.25 

6.83 

4.50 

3.75 

5.75 

5.42 

6.08 

5.83 

6.67 

3.67 


ZAI 

4.75 

3.00 

6.08 

6.67 

5.00 

5.58 

4.83 

6.17 

5.67 

6.50 

6.92 


K 


min 

C. {i*}f 


55 d '*■ 

k=lC(i)=k 


(14.38) 


Kaufman and Rousseeuw (1990) propose an alternative strategy for directly 
solving (14.38) that provisionally exchanges each center with an obser¬ 
vation that is not currently a center, selecting the exchange that produces 
the greatest reduction in the value of the criterion (14.38). This is repeated 
until no advantageous exchanges can be found. Massart et al. (1983) derive 
a branch-and-bound combinatorial method that finds the global minimum 
of (14.38) that is practical only for very small data sets. 

Example: Country Dissimilarities 

This example, taken from Kaufman and Rousseeuw (1990), comes from a 
study in which political science students were asked to provide pairwise dis¬ 
similarity measures for 12 countries: Belgium, Brazil, Chile, Cuba, Egypt, 
France, India, Israel, United States, Union of Soviet Socialist Republics, 
Yugoslavia and Zaire. The average dissimilarity scores are given in Ta¬ 
ble 14.3. We applied 3-medoid clustering to these dissimilarities. Note that 
A"-means clustering could not be applied because we have only distances 
rather than raw observations. The left panel of Figure 14.10 shows the 
dissimilarities reordered and blocked according to the 3-medoid clustering. 
The right panel is a two-dimensional multidimensional scaling plot, with 
the 3-medoid clusters assignments indicated by colors (multidimensional 
scaling is discussed in Section 14.8.) Both plots show three well-separated 
clusters, but the MDS display indicates that “Egypt” falls about halfway 
between two clusters. 
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FIGURE 14.10. Survey of country dissimilarities. (Left panel:) dissimilarities 
reordered and blocked according to 3-medoid clustering. Heat map is coded from 
most similar (dark red) to least similar (bright red). (Right panel:) two-dimen¬ 
sional multidimensional scaling plot, with 3-medoid clusters indicated by different 
colors. 

14-3.11 Practical Issues 

In order to apply AT-means or AT-medoids one must select the number of 
clusters K* and an initialization. The latter can be defined by specifying 
an initial set of centers {mi,..., mx} or {ii, ..., ix} or an initial encoder 
C(i). Usually specifying the centers is more convenient. Suggestions range 
from simple random selection to a deliberate strategy based on forward 
stepwise assignment. At each step a new center ik is chosen to minimize 
the criterion (14.33) or (14.38), given the centers i\, ..., ik- i chosen at the 
previous steps. This continues for K steps, thereby producing AT initial 
centers with which to begin the optimization algorithm. 

A choice for the number of clusters K depends on the goal. For data 
segmentation K is usually defined as part of the problem. For example, 
a company may employ K sales people, and the goal is to partition a 
customer database into AT segments, one for each sales person, such that the 
customers assigned to each one are as similar as possible. Often, however, 
cluster analysis is used to provide a descriptive statistic for ascertaining the 
extent to which the observations comprising the data base fall into natural 
distinct groupings. Here the number of such groups K* is unknown and 
one requires that it, as well as the groupings themselves, be estimated from 
the data. 

Data-based methods for estimating K* typically examine the within- 
cluster dissimilarity Wk as a function of the number of clusters K. Separate 
solutions are obtained for K £ {1, 2,..., A' max }. The corresponding values 





14.3 Cluster Analysis 519 


{ W\. W -2 , • • ■, } generally decrease with increasing K. This will be 

the case even when the criterion is evaluated on an independent test set, 
since a large number of cluster centers will tend to fill the feature space 
densely and thus will be close to all data points. Thus cross-validation 
techniques, so useful for model selection in supervised learning, cannot be 
utilized in this context. 

The intuition underlying the approach is that if there are actually K* 
distinct groupings of the observations (as defined by the dissimilarity mea¬ 
sure), then for K < K* the clusters returned by the algorithm will each 
contain a subset of the true underlying groups. That is, the solution will 
not assign observations in the same naturally occurring group to different 
estimated clusters. To the extent that this is the case, the solution criterion 
value will tend to decrease substantially with each successive increase in the 
number of specified clusters, Wk+ i Wk, as the natural groups are suc¬ 
cessively assigned to separate clusters. For K > K*, one of the estimated 
clusters must partition at least one of the natural groups into two sub¬ 
groups. This will tend to provide a smaller decrease in the criterion as K is 
further increased. Splitting a natural group, within which the observations 
are all quite close to each other, reduces the criterion less than partitioning 
the union of two well-separated groups into their proper constituents. 

To the extent this scenario is realized, there will be a sharp decrease in 
successive differences in criterion value, Wk — Wk+ i, at K = K*. That 
is, {W K ~ W K +i | K < K*} > {W K - W K + 1 1 K > K*}. An estimate 
K* for K* is then obtained by identifying a “kink” in the plot of Wk as a 
function of K. As with other aspects of clustering procedures, this approach 
is somewhat heuristic. 

The recently proposed Gap statistic (Tibshirani et ah, 2001b) compares 
the curve log Wk to the curve obtained from data uniformly distributed 
over a rectangle containing the data. It estimates the optimal number of 
clusters to be the place where the gap between the two curves is largest. 
Essentially this is an automatic way of locating the aforementioned “kink.” 
It also works reasonably well when the data fall into a single cluster, and 
in that case will tend to estimate the optimal number of clusters to be one. 
This is the scenario where most other competing methods fail. 

Figure 14.11 shows the result of the Gap statistic applied to simulated 
data of Figure 14.4. The left panel shows log Wk for k = 1, 2,..., 8 clusters 
(green curve) and the expected value of log Wk over 20 simulations from 
uniform data (blue curve). The right panel shows the gap curve, which is the 
expected curve minus the observed curve. Shown also are error bars of half¬ 
width s' K = sk \J 1 T 1/20, where sk is the standard deviation of log Wk 
over the 20 simulations. The Gap curve is maximized at K = 2 clusters. If 
G(K) is the Gap curve at K clusters, the formal rule for estimating K* is 


K* = argmin{AT|G(Af) > G(K + 1) - s' K+1 }. 

K 


(14.39) 
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2468 2468 

Number of Clusters Number of Clusters 


FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of 
log Wk for the simulated data of Figure lf.f. Both curves have been translated 
to equal zero at one cluster. (Right panel): Gap curve, equal to the difference 
between the observed and expected values of log Wk ■ The Gap estimate K* is the 
smallest K producing a gap within one standard deviation of the gap at K + 1; 
here K* = 2. 

This gives K* = 2, which looks reasonable from Figure 14.4. 

14-3.12 Hierarchical Clustering 

The results of applying iF-means or AT-medoids clustering algorithms de¬ 
pend on the choice for the number of clusters to be searched and a starting 
configuration assignment. In contrast, hierarchical clustering methods do 
not require such specifications. Instead, they require the user to specify a 
measure of dissimilarity between (disjoint) groups of observations, based 
on the pairwise dissimilarities among the observations in the two groups. 
As the name suggests, they produce hierarchical representations in which 
the clusters at each level of the hierarchy are created by merging clusters 
at the next lower level. At the lowest level, each cluster contains a single 
observation. At the highest level there is only one cluster containing all of 
the data. 

Strategies for hierarchical clustering divide into two basic paradigms: ag- 
glomerative (bottom-up) and divisive (top-down). Agglomerative strategies 
start at the bottom and at each level recursively merge a selected pair of 
clusters into a single cluster. This produces a grouping at the next higher 
level with one less cluster. The pair chosen for merging consist of the two 
groups with the smallest intergroup dissimilarity. Divisive methods start 
at the top and at each level recursively split one of the existing clusters at 
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that level into two new clusters. The split is chosen to produce two new 
groups with the largest between-group dissimilarity. With both paradigms 
there are TV — 1 levels in the hierarchy. 

Each level of the hierarchy represents a particular grouping of the data 
into disjoint clusters of observations. The entire hierarchy represents an 
ordered sequence of such groupings. It is up to the user to decide which 
level (if any) actually represents a “natural” clustering in the sense that 
observations within each of its groups are sufficiently more similar to each 
other than to observations assigned to different groups at that level. The 
Gap statistic described earlier can be used for this purpose. 

Recursive binary splitting/agglomeration can be represented by a rooted 
binary tree. The nodes of the trees represent groups. The root node repre¬ 
sents the entire data set. The N terminal nodes each represent one of the 
individual observations (singleton clusters). Each nonterminal node (“par¬ 
ent”) has two daughter nodes. For divisive clustering the two daughters 
represent the two groups resulting from the split of the parent; for agglom- 
erative clustering the daughters represent the two groups that were merged 
to form the parent. 

All agglomerative and some divisive methods (when viewed bottom-up) 
possess a monotonicity property. That is, the dissimilarity between merged 
clusters is monotone increasing with the level of the merger. Thus the 
binary tree can be plotted so that the height of each node is proportional 
to the value of the intergroup dissimilarity between its two daughters. The 
terminal nodes representing individual observations are all plotted at zero 
height. This type of graphical display is called a dendrogram. 

A dendrogram provides a highly interpretable complete description of 
the hierarchical clustering in a graphical format. This is one of the main 
reasons for the popularity of hierarchical clustering methods. 

For the microarray data, Figure 14.12 shows the dendrogram resulting 
from agglomerative clustering with average linkage; agglomerative cluster¬ 
ing and this example are discussed in more detail later in this chapter. 
Cutting the dendrogram horizontally at a particular height partitions the 
data into disjoint clusters represented by the vertical lines that intersect 
it. These are the clusters that would be produced by terminating the pro¬ 
cedure when the optimal intergroup dissimilarity exceeds that threshold 
cut value. Groups that merge at high values, relative to the merger values 
of the subgroups contained within them lower in the tree, are candidates 
for natural clusters. Note that this may occur at several different levels, 
indicating a clustering hierarchy: that is, clusters nested within clusters. 

Such a dendrogram is often viewed as a graphical summary of the data 
itself, rather than a description of the results of the algorithm. However, 
such interpretations should be treated with caution. First, different hierar¬ 
chical methods (see below), as well as small changes in the data, can lead 
to quite different dendrograms. Also, such a summary will be valid only to 
the extent that the pairwise observation dissimilarities possess the hierar- 
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FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with 
average linkage to the human tumor microarray data. 


chical structure produced by the algorithm. Hierarchical methods impose 
hierarchical structure whether or not such structure actually exists in the 
data. 

The extent to which the hierarchical structure produced by a dendro¬ 
gram actually represents the data itself can be judged by the cophenetic 
correlation coefficient. This is the correlation between the N(N— l)/2 pair¬ 
wise observation dissimilarities du> input to the algorithm and their corre¬ 
sponding cophenetic dissimilarities Cw derived from the dendrogram. The 
cophenetic dissimilarity Cw between two observations ( i,i') is the inter¬ 
group dissimilarity at which observations i and i' are first joined together 
in the same cluster. 

The cophenetic dissimilarity is a very restrictive dissimilarity measure. 
First, the Cu' over the observations must contain many ties, since only N— 1 
of the total N(N — l)/2 values can be distinct. Also these dissimilarities 
obey the ultrametric inequality 


11 iax j Cf j.. Cf/ p. j 


(14.40) 
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for any three observations As a geometric example, suppose the 

data were represented as points in a Euclidean coordinate system. In order 
for the set of interpoint distances over the data to conform to (14.40), the 
triangles formed by all triples of points must be isosceles triangles with the 
unequal length no longer than the length of the two equal sides (Jain and 
Dubes, 1988). Therefore it is unrealistic to expect general dissimilarities 
over arbitrary data sets to closely resemble their corresponding cophenetic 
dissimilarities as calculated from a dendrogram, especially if there are not 
many tied values. Thus the dendrogram should be viewed mainly as a de¬ 
scription of the clustering structure of the data as imposed by the particular 
algorithm employed. 

Agglomerative Clustering 

Agglomerative clustering algorithms begin with every observation repre¬ 
senting a singleton cluster. At each of the N — 1 steps the closest two (least 
dissimilar) clusters are merged into a single cluster, producing one less clus¬ 
ter at the next higher level. Therefore, a measure of dissimilarity between 
two clusters (groups of observations) must be defined. 

Let G and H represent two such groups. The dissimilarity d(G, H) be¬ 
tween G and H is computed from the set of pairwise observation dissim¬ 
ilarities da / where one member of the pair i is in G and the other i' is 
in H. Single linkage (SL) agglomerative clustering takes the intergroup 
dissimilarity to be that of the closest (least dissimilar) pair 

d SL (G, H) = vaindu'. (14.41) 

igG 
i'&H 

This is also often called the nearest-neighbor technique. Complete linkage 
(CL) agglomerative clustering ( furthest-neighbor technique) takes the in¬ 
tergroup dissimilarity to be that of the furthest (most dissimilar) pair 

d C L(G,H) = max du >. (14.42) 

i£G 
i'eH 

Group average (GA) clustering uses the average dissimilarity between the 
groups 

d GA (G,H) = EE d iV (14.43) 

u ieGi’GH 

where Nq and Nh are the respective number of observations in each group. 
Although there have been many other proposals for defining intergroup 
dissimilarity in the context of agglomerative clustering, the above three are 
the ones most commonly used. Figure 14.13 shows examples of all three. 

If the data dissimilarities {da’} exhibit a strong clustering tendency, with 
each of the clusters being compact and well separated from others, then all 
three methods produce similar results. Clusters are compact if all of the 
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Average Linkage Complete Linkage Single Linkage 




FIGURE 14.13. Dendrograms from agglomerative hierarchical clustering of hu¬ 
man tumor microarray data. 

observations within them are relatively close together (small dissimilarities) 
as compared with observations in different clusters. To the extent this is 
not the case, results will differ. 

Single linkage (14.41) only requires that a single dissimilarity , i £ G 
and i' € H, be small for two groups G and H to be considered close 
together, irrespective of the other observation dissimilarities between the 
groups. It will therefore have a tendency to combine, at relatively low 
thresholds, observations linked by a series of close intermediate observa¬ 
tions. This phenomenon, referred to as chaining , is often considered a de¬ 
fect of the method. The clusters produced by single linkage can violate the 
“compactness” property that all observations within each cluster tend to 
be similar to one another, based on the supplied observation dissimilari¬ 
ties {da>}. If we define the diameter Dq of a group of observations as the 
largest dissimilarity among its members 

Dg = maxdjj/, (14.44) 

ieG 
i'eG 

then single linkage can produce clusters with very large diameters. 

Complete linkage (14.42) represents the opposite extreme. Two groups 
G and H are considered close only if all of the observations in their union 
are relatively similar. It will tend to produce compact clusters with small 
diameters (14.44). However, it can produce clusters that violate the “close¬ 
ness” property. That is, observations assigned to a cluster can be much 
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closer to members of other clusters than they are to some members of their 
own cluster. 

Group average clustering (14.43) represents a compromise between the 
two extremes of single and complete linkage. It attempts to produce rel¬ 
atively compact clusters that are relatively far apart. However, its results 
depend on the numerical scale on which the observation dissimilarities dw 
are measured. Applying a monotone strictly increasing transformation h(-) 
to the da', ha' = h(du'), can change the result produced by (14.43). In 
contrast, (14.41) and (14.42) depend only on the ordering of the du' and 
are thus invariant to such monotone transformations. This invariance is 
often used as an argument in favor of single or complete linkage over group 
average methods. 

One can argue that group average clustering has a statistical consis¬ 
tency property violated by single and complete linkage. Assume we have 
attribute-value data X T = (Xi ,... ,X p ) and that each cluster k is a ran¬ 
dom sample from some population joint density Pk{x). The complete data 
set is a random sample from a mixture of K such densities. The group 
average dissimilarity dcAiG.H) (14.43) is an estimate of 



(14.45) 


where d{x, x') is the dissimilarity between points x and x' in the space 
of attribute values. As the sample size N approaches infinity dcA^G, H) 
(14.43) approaches (14.45), which is a characteristic of the relationship 
between the two densities po(x) and ph{x) . For single linkage, dsL(G, H) 
(14.41) approaches zero as N — > oo independent of pg(x) and Ph{x) ■ For 
complete linkage, dcL{G,H) (14.42) becomes infinite as N — > oo, again 
independent of the two densities. Thus, it is not clear what aspects of the 
population distribution are being estimated by dsL(G,H) and dcL{G, H). 

Example: Human Cancer Microarray Data (Continued) 

The left panel of Figure 14.13 shows the dendrogram resulting from average 
linkage agglomerative clustering of the samples (columns) of the microarray 
data. The middle and right panels show the result using complete and single 
linkage. Average and complete linkage gave similar results, while single 
linkage produced unbalanced groups with long thin clusters. We focus on 
the average linkage clustering. 

Like AT-means clustering, hierarchical clustering is successful at clustering 
simple cancers together. However it has other nice features. By cutting off 
the dendrogram at various heights, different numbers of clusters emerge, 
and the sets of clusters are nested within one another. Secondly, it gives 
some partial ordering information about the samples. In Figure 14.14, we 
have arranged the genes (rows) and samples (columns) of the expression 
matrix in orderings derived from hierarchical clustering. 
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Note that if we flip the orientation of the branches of a dendrogram at any 
merge, the resulting dendrogram is still consistent with the series of hierar¬ 
chical clustering operations. Hence to determine an ordering of the leaves, 
we must add a constraint. To produce the row ordering of Figure 14.14, 
we have used the default rule in S-PLUS: at each merge, the subtree with 
the tighter cluster is placed to the left (toward the bottom in the rotated 
dendrogram in the figure.) Individual genes are the tightest clusters possi¬ 
ble, and merges involving two individual genes place them in order by their 
observation number. The same rule was used for the columns. Many other 
rules are possible—for example, ordering by a multidimensional scaling of 
the genes; see Section 14.8. 

The two-way rearrangement of Figure 14.14 produces an informative pic¬ 
ture of the genes and samples. This picture is more informative than the 
randomly ordered rows and columns of Figure 1.3 of Chapter 1. Further¬ 
more, the dendrograms themselves are useful, as biologists can, for example, 
interpret the gene clusters in terms of biological processes. 


Divisive Clustering 

Divisive clustering algorithms begin with the entire data set as a single 
cluster, and recursively divide one of the existing clusters into two daugh¬ 
ter clusters at each iteration in a top-down fashion. This approach has not 
been studied nearly as extensively as agglomerative methods in the cluster¬ 
ing literature. It has been explored somewhat in the engineering literature 
(Gersho and Gray, 1992) in the context of compression. In the clustering 
setting, a potential advantage of divisive over agglomerative methods can 
occur when interest is focused on partitioning the data into a relatively 
small number of clusters. 

The divisive paradigm can be employed by recursively applying any of 
the combinatorial methods such as A"-means (Section 14.3.6) or AT-medoids 
(Section 14.3.10), with AT = 2, to perform the splits at each iteration. How¬ 
ever, such an approach would depend on the starting configuration specified 
at each step. In addition, it would not necessarily produce a splitting se¬ 
quence that possesses the monotonicity property required for dendrogram 
representation. 

A divisive algorithm that avoids these problems was proposed by Mac- 
naughton Smith et al. (1965). It begins by placing all observations in a 
single cluster G. It then chooses that observation whose average dissimi¬ 
larity from all the other observations is largest. This observation forms the 
first member of a second cluster H. At each successive step that observation 
in G whose average distance from those in H , minus that for the remaining 
observations in G is largest, is transferred to H. This continues until the 
corresponding difference in averages becomes negative. That is, there are 
no longer any observations in G that are, on average, closer to those in 
H. The result is a split of the original cluster into two daughter clusters, 
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FIGURE 14.14. DNA microarray data: average linkage hierarchical clustering 
has been applied independently to the rows (genes) and columns (samples), de¬ 
termining the ordering of the rows and columns (see text). The colors range from 
bright green (negative, under-expressed) to bright red (positive, over-expressed). 
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the observations transferred to H , and those remaining in G. These two 
clusters represent the second level of the hierarchy. Each successive level 
is produced by applying this splitting procedure to one of the clusters at 
the previous level. Kaufman and Rousseeuw (1990) suggest choosing the 
cluster at each level with the largest diameter (14.44) for splitting. An al¬ 
ternative would be to choose the one with the largest average dissimilarity 
among its members 



The recursive splitting continues until all clusters either become singletons 
or all members of each one have zero dissimilarity from one another. 


14.4 Self-Organizing Maps 

This method can be viewed as a constrained version of K-means clustering, 
in which the prototypes are encouraged to lie in a one- or two-dimensional 
manifold in the feature space. The resulting manifold is also referred to 
as a constrained topological map , since the original high-dimensional obser¬ 
vations can be mapped down onto the two-dimensional coordinate system. 
The original SOM algorithm was online—observations are processed one at 
a time—and later a batch version was proposed. The technique also bears 
a close relationship to principal curves and surfaces , which are discussed in 
the next section. 

We consider a SOM with a two-dimensional rectangular grid of K proto¬ 
types rrij € IR P (other choices, such as hexagonal grids, can also be used). 
Each of the K prototypes are parametrized with respect to an integer 
coordinate pair lj € Qi x Q 2 . Here Qi = {1, 2,..., q {\, similarly Q 2 , and 
K = q\-q 2 . The rrij are initialized, for example, to lie in the two-dimensional 
principal component plane of the data (next section). We can think of the 
prototypes as “buttons,” “sewn” on the principal component plane in a 
regular pattern. The SOM procedure tries to bend the plane so that the 
buttons approximate the data points as well as possible. Once the model is 
fit, the observations can be mapped down onto the two-dimensional grid. 

The observations Xi are processed one at a time. We find the closest 
prototype rrij to Xi in Euclidean distance in IR P , and then for all neighbors 
m k of rrij, move nrik toward Xi via the update 


m k <- m k + a{xi - m k ). 


(14.46) 


The “neighbors” of nij are defined to be all m k such that the distance 
between lj and l k is small. The simplest approach uses Euclidean distance, 
and “small” is determined by a threshold r. This neighborhood always 
includes the closest prototype m ; itself. 
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Notice that distance is defined in the space Qi x Q 2 of integer topological 
coordinates of the prototypes, rather than in the feature space IR P . The 
effect of the update (14.46) is to move the prototypes closer to the data, 
but also to maintain a smooth two-dimensional spatial relationship between 
the prototypes. 

The performance of the SOM algorithm depends on the learning rate 
a and the distance threshold r. Typically a is decreased from say 1.0 to 
0.0 over a few thousand iterations (one per observation). Similarly r is 
decreased linearly from starting value R to 1 over a few thousand iterations. 
We illustrate a method for choosing R in the example below. 

We have described the simplest version of the SOM. More sophisticated 
versions modify the update step according to distance: 

m k <- m k + ah(\\£j - 4||)(®i - rn k ), (14.47) 

where the neighborhood function h gives more weight to prototypes m k with 
indices £ k closer to ij than to those further away. 

If we take the distance r small enough so that each neighborhood contains 
only one point, then the spatial connection between prototypes is lost. In 
that case one can show that the SOM algorithm is an online version of 
.A-means clustering, and eventually stabilizes at one of the local minima 
found by A'-means. Since the SOM is a constrained version of A'-irieans 
clustering, it is important to check whether the constraint is reasonable 
in any given problem. One can do this by computing the reconstruction 
error ||x — mj|| 2 , summed over observations, for both methods. This will 
necessarily be smaller for AT-means, but should not be much smaller if the 
SOM is a reasonable approximation. 

As an illustrative example, we generated 90 data points in three dimen¬ 
sions, near the surface of a half sphere of radius 1. The points were in each 
of three clusters—red, green, and blue—located near (0,1,0), (0,0,1) and 
(1,0,0). The data are shown in Figure 14.15 

By design, the red cluster was much tighter than the green or blue ones. 
(Full details of the data generation are given in Exercise 14.5.) A 5 x 5 grid 
of prototypes was used, with initial grid size R = 2; this meant that about 
a third of the prototypes were initially in each neighborhood. We did a 
total of 40 passes through the dataset of 90 observations, and let r and a 
decrease linearly over the 3600 iterations. 

In Figure 14.16 the prototypes are indicated by circles, and the points 
that project to each prototype are plotted randomly within the correspond¬ 
ing circle. The left panel shows the initial configuration, while the right 
panel shows the final one. The algorithm has succeeded in separating the 
clusters; however, the separation of the red cluster indicates that the man¬ 
ifold has folded back on itself (see Figure 14.17). Since the distances in the 
two-dimensional display are not used, there is little indication in the SOM 
projection that the red cluster is tighter than the others. 
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FIGURE 14.15. Simulated data in three classes, near the surface of a half- 
sphere. 
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FIGURE 14.16. Self-organizing map applied to half-sphere data example. Left 
panel is the initial configuration, right panel the final one. The 5x5 grid of 
prototypes are indicated by circles, and the points that project to each prototype 
are plotted randomly within the corresponding circle. 
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FIGURE 14.17. Wiremesh representation of the fitted SOM model in 1R 3 . The 
lines represent the horizontal and vertical edges of the topological lattice. The 
double lines indicate that the surface was folded diagonally back on itself in order 
to model the red points. The cluster members have been jittered to indicate their 
color, and the purple points are the node centers. 


Figure 14.18 shows the reconstruction error, equal to the total sum of 
squares of each data point around its prototype. For comparison we carried 
out a AT-means clustering with 25 centroids, and indicate its reconstruction 
error by the horizontal line on the graph. We see that the SOM significantly 
decreases the error, nearly to the level of the A'-means solution. This pro¬ 
vides evidence that the two-dimensional constraint used by the SOM is 
reasonable for this particular dataset. 

In the batch version of the SOM, we update each rrij via 


rrij = 


E W kXk 
E w k 


(14.48) 


The sum is over points Xk that mapped (i.e., were closest to) neighbors rrik 
of rrij. The weight function may be rectangular, that is, equal to 1 for the 
neighbors of rrik , or may decrease smoothly with distance \\ik~(-j || as before. 
If the neighborhood size is chosen small enough so that it consists only 
of rrik , with rectangular weights, this reduces to the AT-means clustering 
procedure described earlier. It can also be thought of as a discrete version 
of principal curves and surfaces, described in Section 14.5. 
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FIGURE 14.18. Half-sphere data: reconstruction error for the SOM as a func¬ 
tion of iteration. Error for k-means clustering is indicated by the horizontal line. 


Example: Document Organization and Retrieval 

Document retrieval has gained importance with the rapid development of 
the Internet and the Web, and SOMs have proved to be useful for organiz¬ 
ing and indexing large corpora. This example is taken from the WEBSOM 
homepage http://websom.hut.fi/ (Kohonen et ah, 2000). Figure 14.19 rep¬ 
resents a SOM fit to 12,088 newsgroup comp.ai.neural-nets articles. The 
labels are generated automatically by the WEBSOM software and provide 
a guide as to the typical content of a node. 

In applications such as this, the documents have to be reprocessed in 
order to create a feature vector. A term-document matrix is created, where 
each row represents a single document. The entries in each row are the 
relative frequency of each of a predefined set of terms. These terms could 
be a large set of dictionary entries (50,000 words), or an even larger set 
of bigrams (word pairs), or subsets of these. These matrices are typically 
very sparse, and so often some preprocessing is done to reduce the number 
of features (columns). Sometimes the SVD (next section) is used to reduce 
the matrix; Kohonen et al. (2000) use a randomized variant thereof. These 
reduced vectors are then the input to the SOM. 
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FIGURE 14.19. Heatmap representation of the SOM model fit to a corpus 
of 12,088 newsgroup comp.ai.neural-nets contributions (courtesy WEBSOM 
homepage). The lighter areas indicate higher-density areas. Populated nodes are 
automatically labeled according to typical content. 
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FIGURE 14.20. The first linear principal component of a set of data. The line 
minimizes the total squared distance from each point to its orthogonal projection 
onto the line. 

In this application the authors have developed a “zoom” feature, which 
allows one to interact with the map in order to get more detail. The final 
level of zooming retrieves the actual news articles, which can then be read. 


14.5 Principal Components, Curves and Surfaces 

Principal components are discussed in Sections 3.4.1, where they shed light 
on the shrinkage mechanism of ridge regression. Principal components are 
a sequence of projections of the data, mutually uncorrelated and ordered 
in variance. In the next section we present principal components as linear 
manifolds approximating a set of N points Xi £ H p . We then present 
some nonlinear generalizations in Section 14.5.2. Other recent proposals 
for nonlinear approximating manifolds are discussed in Section 14.9. 


14-5.1 Principal Components 

The principal components of a set of data in 1R P provide a sequence of best 
linear approximations to that data, of all ranks q < p. 

Denote the observations by x\, X2, ■ ■ ■, %n, and consider the rank-f/ linear 
model for representing them 



14.5 Principal Components, Curves and Surfaces 535 


/(A) = M + V g A, (14.49) 

where /r is a location vector in 1R P , V 9 is a p x q matrix with q orthogonal 
unit vectors as columns, and A is a q vector of parameters. This is the 
parametric representation of an affine hyperplane of rank q. Figures 14.20 
and 14.21 illustrate for q = 1 and q = 2, respectively. Fitting such a model 
to the data by least squares amounts to minimizing the reconstruction error 

N 

min V II^-m-V^II 2 . (14.50) 

MM, V 9 

We can partially optimize for p and the Aj (Exercise 14.7) to obtain 

jl = x, (14.51) 

A l = V^(xi-x). (14.52) 

This leaves us to find the orthogonal matrix Vy 
N 

miny ||(ar» - x) - V 9 V^(xi - 5)|| 2 . (14.53) 

i=i 

For convenience we assume that 5 = 0 (otherwise we simply replace the 
observations by their centered versions x t = Xi — x). The p x p matrix 
H ? = is a projection matrix , and maps each point Xi onto its rank- 

q reconstruction H q Xi, the orthogonal projection of Xi onto the subspace 
spanned by the columns of V g . The solution can be expressed as follows. 
Stack the (centered) observations into the rows of an N x p matrix X. We 
construct the singular value decomposition of X: 

X = UDV t . (14.54) 

This is a standard decomposition in numerical analysis, and many algo¬ 
rithms exist for its computation (Golub and Van Loan, 1983, for example). 
Here U is an N x p orthogonal matrix (U T U = I p ) whose columns Uj are 
called the left singular vectors; V is a pxp orthogonal matrix (V T V = I p ) 
with columns Vj called the right singular vectors, and D is a p x p diagonal 
matrix, with diagonal elements d\ > c ?2 > • • • > d p > 0 known as the sin¬ 
gular values. For each rank q, the solution V 9 to (14.53) consists of the first 
q columns of V. The columns of UD are called the principal components 
of X (see Section 3.5.1). The N optimal Aj in (14.52) are given by the first 
q principal components (the N rows of the N x q matrix U g D g ). 

The one-dimensional principal component line in IR 2 is illustrated in Fig¬ 
ure 14.20. For each data point x^, there is a closest point on the line, given 
by undiVi. Here v\ is the direction of the line and A, = und\ measures 
distance along the line from the origin. Similarly Figure 14.21 shows the 
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FIGURE 14.21. The best rank-two linear approximation to the half-sphere data. 
The right panel shows the projected points with coordinates given by U 2 D 2 , the 
first two principal components of the data. 


two-dimensional principal component surface fit to the half-sphere data 
(left panel). The right panel shows the projection of the data onto the 
first two principal components. This projection was the basis for the initial 
configuration for the SOM method shown earlier. The procedure is quite 
successful at separating the clusters. Since the half-sphere is nonlinear, a 
nonlinear projection will do a better job, and this is the topic of the next 
section. 

Principal components have many other nice properties, for example, the 
linear combination Xiq has the highest variance among all linear com¬ 
binations of the features; Xt >2 has the highest variance among all linear 
combinations satisfying V 2 orthogonal to iq, and so on. 

Example: Handwritten Digits 

Principal components are a useful tool for dimension reduction and com¬ 
pression. We illustrate this feature on the handwritten digits data described 
in Chapter 1. Figure 14.22 shows a sample of 130 handwritten 3’s, each a 
digitized 16 x 16 grayscale image, from a total of 658 such 3’s. We see 
considerable variation in writing styles, character thickness and orienta¬ 
tion. We consider these images as points aq in IR 256 , and compute their 
principal components via the SVD (14.54). 

Figure 14.23 shows the first two principal components of these data. For 
each of these first two principal components un and Ui 2 , we computed the 
5%, 25%, 50%, 75% and 95% quantile points, and used them to define 
the rectangular grid superimposed on the plot. The circled points indicate 
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FIGURE 14.22. A sample of 130 handwritten 3’s shows a variety of writing 
styles. 


those images close to the vertices of the grid, where the distance measure 
focuses mainly on these projected coordinates, but gives some weight to the 
components in the orthogonal subspace. The right plot shows the images 
corresponding to these circled points. This allows us to visualize the nature 
of the first two principal components. We see that the v\ (horizontal move¬ 
ment) mainly accounts for the lengthening of the lower tail of the three, 
while V 2 (vertical movement) accounts for character thickness. In terms of 
the parametrized model (14.49), this two-component model has the form 


/(A) 


X + AiVi + \ 2 V2 





(14.55) 


Here we have displayed the first two principal component directions, Vi 
and v 2l as images. Although there are a possible 256 principal components, 
approximately 50 account for 90% of the variation in the threes, 12 ac¬ 
count for 63%. Figure 14.24 compares the singular values to those obtained 
for equivalent uncorrelated data, obtained by randomly scrambling each 
column of X. The pixels in a digitized image are inherently correlated, 
and since these are all the same digit the correlations are even stronger. 
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FIGURE 14.23. (Left panel:) the first two principal components of the hand¬ 
written threes. The circled points are the closest projected images to the vertices 
of a grid, defined by the marginal quantiles of the principal components. (Right 
panel:) The images corresponding to the circled points. These show the nature of 
the first two principal components. 



FIGURE 14.24. The 256 singular values for the digitized threes, compared to 
those for a randomized version of the data (each column of~X was scrambled). 
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A relatively small subset of the principal components serve as excellent 
lower-dimensional features for representing the high-dimensional data. 


Example: Procrustes Transformations and Shape Averaging 



FIGURE 14.25. (Left panel:) Two different digitized handwritten Ss, each rep¬ 
resented by 96 corresponding points in 1R 2 . The green S has been deliberately 
rotated and translated for visual effect. (Right panel:) A Procrustes transforma¬ 
tion applies a translation and rotation to best match up the two set of points. 


Figure 14.25 represents two sets of points, the orange and green, in the 
same plot. In this instance these points represent two digitized versions 
of a handwritten S, extracted from the signature of a subject “Suresh.” 
Figure 14.26 shows the entire signatures from which these were extracted 
(third and fourth panels). The signatures are recorded dynamically using 
touch-screen devices, familiar sights in modern supermarkets. There are 
N = 96 points representing each S, which we denote by the N x 2 matrices 
Xi and X 2 . There is a correspondence between the points—the ith rows 
of Xi and X 2 are meant to represent the same positions along the two S’s. 
In the language of morphometries, these points represent landmarks on 
the two objects. How one finds such corresponding landmarks is in general 
difficult and subject specific. In this particular case we used dynamic time 
warping of the speed signal along each signature (Hastie et al., 1992), but 
will not go into details here. 

In the right panel we have applied a translation and rotation to the green 
points so as best to match the orange—a so-called Procrustes 3 transforma¬ 
tion (Mardia et al., 1979, for example). 

Consider the problem 


min||X 2 - (XrR+ lp T )\\ F , (14.56) 

/i,R 


3 Procrustes was an African bandit in Greek mythology, who stretched or squashed 
his visitors to fit his iron bed (eventually killing them). 
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with Xi and X 2 both N x p matrices of corresponding points, R an or- 
thonormal p x p matrix 4 , and p a p -vector of location coordinates. Here 
||X]|! = trace(X T X) is the squared Frobenius matrix norm. 

Let x 1 and X 2 be the column mean vectors of the matrices, and Xi and 
X 2 be the versions of these matrices with the means removed. Consider 
the SVD XfX 2 = UDV t . Then the solution to (14.56) is given by (Exer¬ 
cise 14.8) 


R = UV T 

f = X2 — Rxi, 


(14.57) 


and the minimal distances is referred to as the Procrustes distance. From 
the form of the solution, we can center each matrix at its column centroid, 
and then ignore location completely. Hereafter we assume this is the case. 

The Procrustes distance with scaling solves a slightly more general 
problem, 

min||X 2 -/3X!R|| f , (14.58) 

/3,R 

where /3 > 0 is a positive scalar. The solution for R is as before, with 
P = trace(D)/||X 1 |||,. 

Related to Procrustes distance is the Procrustes average of a collection 
of L shapes, which solves the problem 


L 


min V||X,R £ -M||^ ; 


(14.59) 


that is, find the shape M closest in average squared Procrustes distance to 
all the shapes. This is solved by a simple alternating algorithm: 

0. Initialize M = Xi (for example). 

1. Solve the L Procrustes rotation problems with M fixed, yielding 

X' <- XRf. 

2. LetM<— i£tiX'. 

Steps 1. and 2. are repeated until the criterion (14.59) converges. 

Figure 14.26 shows a simple example with three shapes. Note that we can 
only expect a solution up to a rotation; alternatively, we can impose a 
constraint, such as that M be upper-triangular, to force uniqueness. One 
can easily incorporate scaling in the definition (14.59); see Exercise 14.9. 

Most generally we can define the affine-invariant average of a set of 
shapes via 


4 To simplify matters, we consider only orthogonal matrices which include reflections 
as well as rotations [the 0(p) group]; although reflections are unlikely here, these methods 
can be restricted further to allow only rotations [50(p) group]. 
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FIGURE 14.26. The Procrustes average of three versions of the leading S in 
Suresh’s signatures. The left panel shows the preshape average, with each of the 
shapes in preshape space superimposed. The right three panels map the pre¬ 
shape M separately to match each of the original S’s. 

L 

min V||X^A £ -M||i, (14.60) 

where the Ae are any p x p nonsingular matrices. Here we require a stan¬ 
dardization, such as M t M = I, to avoid a trivial solution. The solution is 
attractive, and can be computed without iteration (Exercise 14.10): 

1. Let be the rank-p projection matrix defined 

by X,. 

2. M is the iVxp matrix formed from the p largest eigenvectors of H = 

14-5.2 Principal Curves and Surfaces 

Principal curves generalize the principal component line, providing a smooth 
one-dimensional curved approximation to a set of data points in 1R P . A prin¬ 
cipal surface is more general, providing a curved manifold approximation 
of dimension 2 or more. 

We will first define principal curves for random variables X £ 1R P , and 
then move to the finite data case. Let /(A) be a parameterized smooth 
curve in IR P . Hence /(A) is a vector function with p coordinates, each a 
smooth function of the single parameter A. The parameter A can be chosen, 
for example, to be arc-length along the curve from some fixed origin. For 
each data value x, let A f(x) define the closest point on the curve to x. Then 
/(A) is called a principal curve for the distribution of the random vector 
X if 


/(A) = E(X|A/(X) = A). (14.61) 

This says /(A) is the average of all data points that project to it, that is, the 
points for which it is “responsible.” This is also known as a self-consistency 
property. Although in practice, continuous multivariate distributes have 
infinitely many principal curves (Duchamp and Stuetzle, 1996), we are 
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FIGURE 14.27. The principal curve of a set of data. Each point on the curve 
is the average of all data points that project there. 

interested mainly in the smooth ones. A principal curve is illustrated in 
Figure 14.27. 

Principal points are an interesting related concept. Consider a set of k 
prototypes and for each point x in the support of a distribution, identify 
the closest prototype, that is, the prototype that is responsible for it. This 
induces a partition of the feature space into so-called Voronoi regions. The 
set of k points that minimize the expected distance from X to its prototype 
are called the principal points of the distribution. Each principal point is 
self-consistent, in that it equals the mean of X in its Voronoi region. For 
example, with k = 1, the principal point of a circular normal distribution is 
the mean vector; with k = 2 they are a pair of points symmetrically placed 
on a ray through the mean vector. Principal points are the distributional 
analogs of centroids found by AT-means clustering. Principal curves can be 
viewed as k = oo principal points, but constrained to lie on a smooth curve, 
in a similar way that a SOM constrains IGmeans cluster centers to fall on 
a smooth manifold. 

To find a principal curve /(A) of a distribution, we consider its coordinate 
functions /(A) = [/i(A), / 2 (A),..., / p (A)] and let X T = (X 1 , X 2 ,..., X p ). 
Consider the following alternating steps: 

(a) fj (A) <- EpO|A(X) = A); j = l,2,...,p, 

(b) A/0) «- argmin A , \\x - /(A')|| 2 . 

The first equation fixes A and enforces the self-consistency requirement 
(14.61). The second equation fixes the curve and finds the closest point on 
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FIGURE 14.28. Principal surface fit to half-sphere data. (Left panel:) fitted 
two-dimensional surface. (Right panel:) projections of data points onto the sur¬ 
face, resulting in coordinates Ai,A 2 . 

the curve to each data point. With finite data, the principal curve algorithm 
starts with the linear principal component, and iterates the two steps in 
(14.62) until convergence. A scatterplot smoother is used to estimate the 
conditional expectations in step (a) by smoothing each Xj as a function of 
the arc-length A(A), and the projection in (b) is done for each of the ob¬ 
served data points. Proving convergence in general is difficult, but one can 
show that if a linear least squares fit is used for the scatterplot smoothing, 
then the procedure converges to the first linear principal component, and 
is equivalent to the power method for finding the largest eigenvector of a 
matrix. 

Principal surfaces have exactly the same form as principal curves, but 
are of higher dimension. The mostly commonly used is the two-dimensional 
principal surface, with coordinate functions 

/(Ai, A 2 ) = [/i(Ai, A 2 ),..., f p ( Ai, A 2 )]. 

The estimates in step (a) above are obtained from two-dimensional surface 
smoothers. Principal surfaces of dimension greater than two are rarely used, 
since the visualization aspect is less attractive, as is smoothing in high 
dimensions. 

Figure 14.28 shows the result of a principal surface fit to the half-sphere 
data. Plotted are the data points as a function of the estimated nonlinear 
coordinates Ai(xi), A 2 (xi). The class separation is evident. 

Principal surfaces are very similar to self-organizing maps. If we use a 
kernel surface smoother to estimate each coordinate function fj( Ai,A 2 ), 
this has the same form as the batch version of SOMs (14.48). The SOM 
weights Wk are just the weights in the kernel. There is a difference, however: 
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the principal surface estimates a separate prototype /(Ai(xi), \ 2 {xi)) for 
each data point Xi , while the SOM shares a smaller number of prototypes 
for all data points. As a result, the SOM and principal surface will agree 
only as the number of SOM prototypes grows very large. 

There also is a conceptual difference between the two. Principal sur¬ 
faces provide a smooth parameterization of the entire manifold in terms 
of its coordinate functions, while SOMs are discrete and produce only the 
estimated prototypes for approximating the data. The smooth parameter¬ 
ization in principal surfaces preserves distance locally: in Figure 14.28 it 
reveals that the red cluster is tighter than the green or blue clusters. In 
simple examples the estimates coordinate functions themselves can be in¬ 
formative: see Exercise 14.13. 


14-5.3 Spectral Clustering 

Traditional clustering methods like AT-means use a spherical or elliptical 
metric to group data points. Hence they will not work well when the clus¬ 
ters are non-convex, such as the concentric circles in the top left panel of 
Figure 14.29. Spectral clustering is a generalization of standard clustering 
methods, and is designed for these situations. It has close connections with 
the local multidimensional-scaling techniques (Section 14.9) that generalize 
MDS. 

The starting point is a N x N matrix of pairwise similarities Sa> > 0 be¬ 
tween all observation pairs. We represent the observations in an undirected 
similarity graph G = (V 7 E). The N vertices Uj represent the observations, 
and pairs of vertices are connected by an edge if their similarity is positive 
(or exceeds some threshold). The edges are weighted by the Sa>. Clustering 
is now rephrased as a graph-partition problem, where we identify connected 
components with clusters. We wish to partition the graph, such that edges 
between different groups have low weight, and within a group have high 
weight. The idea in spectral clustering is to construct similarity graphs that 
represent the local neighborhood relationships between observations. 

To make things more concrete, consider a set of N points Xi G H p , and let 
da' be the Euclidean distance between Xi and xv . We will use as similarity 
matrix the radial-kernel gram matrix; that is, sw = exp(—d^,/c), where 
c > 0 is a scale parameter. 

There are many ways to define a similarity matrix and its associated 
similarity graph that reflect local behavior. The most popular is the mutual 
K-nearest-neighbor graph. Define Mk to be the symmetric set of nearby 
pairs of points; specifically a pair ( i , i 1 ) is in Mk if point i is among the 
A'-nearest neighbors of i', or vice-versa. Then we connect all symmetric 
nearest neighbors, and give them edge weight ww = sw; otherwise the 
edge weight is zero. Equivalently we set to zero all the pairwise similarities 
not in Mk , and draw the graph for this modified similarity matrix. 
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Alternatively, a fully connected graph includes all pairwise edges with 
weights Wu> = Sa ', and the local behavior is controlled by the scale param¬ 
eter c. 

The matrix of edge weights W = {ww } from a similarity graph is called 
the adjacency matrix. The degree of vertex i is (jj = JU, , the sum of 
the weights of the edges connected to it. Let G be a diagonal matrix with 
diagonal elements gi. 

Finally, the graph Laplacian is defined by 

L = G - W (14.63) 

This is called the unnormalized graph Laplacian ; a number of normalized 
versions have been proposed—these standardize the Laplacian with respect 
to the node degrees g i: for example, L = I—G _1 W. 

Spectral clustering finds the m eigenvectors Zjrxm corresponding to the 
m smallest eigenvalues of L (ignoring the trivial constant eigenvector). 
Using a standard method like A'-means, we then cluster the rows of Z to 
yield a clustering of the original data points. 

An example is presented in Figure 14.29. The top left panel shows 450 
simulated data points in three circular clusters indicated by the colors. K- 
means clustering would clearly have difficulty identifying the outer clusters. 
We applied spectral clustering using a 10-nearest neighbor similarity graph, 
and display the eigenvector corresponding to the second and third smallest 
eigenvalue of the graph Laplacian in the lower left. The 15 smallest eigen¬ 
values are shown in the top right panel. The two eigenvectors shown have 
identified the three clusters, and a scatterplot of the rows of the eigenvector 
matrix Y in the bottom right clearly separates the clusters. A procedure 
such as AT-means clustering applied to these transformed points would eas¬ 
ily identify the three groups. 

Why does spectral clustering work? For any vector f we have 

N N N 

f T Lf = ^2 9ifi - Ufi'Wii' 

i— 1 i— 1 i'— 1 

N N 

= 2 fi'f- (14.64) 

i=1 i'= 1 

Formula 14.64 suggests that a small value of f T Lf will be achieved if pairs 
of points with large adjacencies have coordinates /,; and /)/ close together. 

Since l 7 LI = 0 for any graph, the constant vector is a trivial eigenvector 
with eigenvalue zero. Not so obvious is the fact that if the graph is con¬ 
nected 5 , it is the only zero eigenvector (Exercise 14.21). Generalizing this 
argument, it is easy to show that for a graph with m connected components, 


5 A graph is connected if any two nodes can be reached via a path of connected nodes. 
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FIGURE 14.29. Toy example illustrating spectral clustering. Data in top left are 
450 points falling in three concentric clusters of 150 points each. The points are 
uniformly distributed in angle, with radius 1,2.8 and 5 in the three groups, and 
Gaussian noise with standard deviation 0.25 added to each point. Using a k = 10 
nearest-neighbor similarity graph, the eigenvector corresponding to the second and 
third smallest eigenvalues of L are shown in the bottom left; the smallest eigen¬ 
vector is constant. The data points are colored in the same way as in the top left. 
The 15 smallest eigenvalues are shown in the top right panel. The coordinates of 
the 2nd and 3rd eigenvectors (the 450 rows of Z) are plotted in the bottom right 
panel. Spectral clustering does standard (e.g., K-means) clustering of these points 
and will easily recover the three original clusters. 
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the nodes can be reordered so that L is block diagonal with a block for each 
connected component. Then L has m eigenvectors of eigenvalue zero, and 
the eigenspace of eigenvalue zero is spanned by the indicator vectors of the 
connected components. In practice one has strong and weak connections, 
so zero eigenvalues are approximated by small eigenvalues. 

Spectral clustering is an interesting approach for finding non-convex clus¬ 
ters. When a normalized graph Laplacian is used, there is another way to 
view this method. Defining P = G _ 1 W, we consider a random walk on 
the graph with transition probability matrix P. Then spectral clustering 
yields groups of nodes such that the random walk seldom transitions from 
one group to another. 

There are a number of issues that one must deal with in applying spec¬ 
tral clustering in practice. We must choose the type of similarity graph—eg. 
fully connected or nearest neighbors, and associated parameters such as the 
number of nearest of neighbors k or the scale parameter of the kernel c. We 
must also choose the number of eigenvectors to extract from L and finally, 
as with all clustering methods, the number of clusters. In the toy example 
of Figure 14.29 we obtained good results for k £ [5, 200], the value 200 cor¬ 
responding to a fully connected graph. With k < 5 the results deteriorated. 
Looking at the top-right panel of Figure 14.29, we see no strong separation 
between the smallest three eigenvalues and the rest. Hence it is not clear 
how many eigenvectors to select. 

14-5-4 Kernel Principal Components 

Spectral clustering is related to kernel principal components , a non-linear 
version of linear principal components. Standard linear principal compo¬ 
nents (PCA) are obtained from the eigenvectors of the covariance matrix, 
and give directions in which the data have maximal variance. Kernel PCA 
(Scholkopf et ah, 1999) expand the scope of PCA, mimicking what we would 
obtain if we were to expand the features by non-linear transformations, and 
then apply PCA in this transformed feature space. 

We show in Section 18.5.2 that the principal components variables Z of 
a data matrix X can be computed from the inner-product (gram) matrix 
K = XX T . In detail, we compute the eigen-decomposition of the double- 
centered version of the gram matrix 

K = (I — M)K(I- M) = UD 2 U t , (14.65) 

with M = 11 T /N, and then Z = UD. Exercise 18.15 shows how to com¬ 
pute the projections of new observations in this space. 

Kernel PCA simply mimics this procedure, interpreting the kernel ma¬ 
trix K = {K(xi, Xv)} as an inner-product matrix of the implicit fea¬ 
tures ((f>(xi), (f>(xi>)) and finding its eigenvectors. The elements of the mth 
component z m (?7ith column of Z) can be written (up to centering) as 
Zim = 1 OijmK(xi,Xj), where a im = Uj m /d m (Exercise 14.16). 
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We can gain more insight into kernel PCA by viewing the z m as sam¬ 
ple evaluations of principal component functions g m € Hk, with Hk the 
reproducing kernel Hilbert space generated by K (see Section 5.8.1). The 
first principal component function <71 solves 

max Va.i'rgi(X) subject to ||< 7 i||^ k = 1 (14.66) 

gi&V-K 

Here Var-/- refers to the sample variance over training data T. The norm 
constraint H^iII-h^ = 1 controls the size and roughness of the function g\, 
as dictated by the kernel K. As in the regression case it can be shown that 
the solution to (14.66) is finite dimensional with representation g±(x) = 
J2j =1 CjK(x,Xj). Exercise 14.17 shows that the solution is defined by Cj = 
aji, j = above. The second principal component function is de¬ 

fined in a similar way, with the additional constraint that (gi,g2)'H K = 0 : 
and so on . 6 

Scholkopf et al. (1999) demonstrate the use of kernel principal compo¬ 
nents as features for handwritten-digit classification, and show that they 
can improve the performance of a classifier when these are used instead of 
linear principal components. 

Note that if we use the radial kernel 

K(x,x r ) = exp(—||x — aj'|| 2 /c), (14.67) 

then the kernel matrix K has the same form as the similarity matrix S in 
spectral clustering. The matrix of edge weights W is a localized version of 
K, setting to zero all similarities for pairs of points that are not nearest 
neighbors. 

Kernel PCA finds the eigenvectors corresponding to the largest eigenval¬ 
ues of K; this is equivalent to finding the eigenvectors corresponding to the 
smallest eigenvalues of 

I K. (14.68) 

This is almost the same as the Laplacian (14.63), the differences being the 
centering of K and the fact that G has the degrees of the nodes along the 
diagonal. 

Figure 14.30 examines the performance of kernel principal components 
in the toy example of Figure 14.29. In the upper left panel we used the ra¬ 
dial kernel with c = 2 , the same value that was used in spectral clustering. 
This does not separate the groups, but with c = 10 (upper right panel), the 
first component separates the groups well. In the lower-left panel we ap¬ 
plied kernel PCA using the nearest-neighbor radial kernel W from spectral 
clustering. In the lower right panel we use the kernel matrix itself as the 


This section benefited from helpful discussions with Jonathan Taylor. 
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FIGURE 14.30. Kernel principal components applied to the toy example of Fig¬ 
ure If.29, using different kernels. (Top left:) Radial kernel (If.67) with c = 2. 
(Top right:) Radial kernel with c = 10. (Bottom left): Nearest neighbor radial ker- 
nel W from spectral clustering. (Bottom right:) Spectral clustering with Laplacian 
constructed from the radial kernel. 
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similarity matrix for constructing the Laplacian (14.63) in spectral cluster¬ 
ing. In neither case do the projections separate the two groups. Adjusting 
c did not help either. 

In this toy example, we see that kernel PCA is quite sensitive to the scale 
and nature of the kernel. We also see that the nearest-neighbor truncation 
of the kernel is important for the success of spectral clustering. 

14-5.5 Sparse Principal Components 

We often interpret principal components by examining the direction vectors 
Vj, also known as loadings , to see which variables play a role. We did this 
with the image loadings in (14.55). Often this interpretation is made easier 
if the loadings are sparse. In this section we briefly discuss some methods 
for deriving principal components with sparse loadings. They are all based 
on lasso [L\) penalties. 

We start with an N x p data matrix X, with centered columns. The 
proposed methods focus on either the maximum-variance property of prin¬ 
cipal components, or the minimum reconstruction error. The SCoTLASS 
procedure of Joliffe et al. (2003) takes the first approach, by solving 

maxw T (X T X)u, subject to \ v o\ — ^ vTv = 1- (14.69) 

The absolute-value constraint encourages some of the loadings to be zero 
and hence v to be sparse. Further sparse principal components are found 
in the same way, by forcing the fcth component to be orthogonal to the 
first k — 1 components. Unfortunately this problem is not convex and the 
computations are difficult. 

Zou et al. (2006) start instead with the regression/reconstruction prop¬ 
erty of PCA, similar to the approach in Section 14.5.1. Let Xi be the ith row 
of X. For a single component, their sparse principal component technique 
solves 


N 

Inin'S^ \\xi — 0v T Xi\\\ + A||u ||2 + Ai||u||i (14.70) 

0,v ^' 

1=1 

subject to || 0||2 = 1. 

Let’s examine this formulation in more detail. 

• If both A and Ai are zero and N > p, it is easy to show that v = 9 
and is the largest principal component direction. 

• When p> N the solution is not necessarily unique unless A > 0. For 
any A > 0 and Ai = 0 the solution for v is proportional to the largest 
principal component direction. 

• The second penalty on v encourages sparseness of the loadings. 


14.5 Principal Components, Curves and Surfaces 551 


Walking Speed 




Verbal Fluency 




FIGURE 14.31. Standard and sparse principal components from a study of 
the corpus callosum variation. The shape variations corresponding to significant 
principal components (red curves) are overlaid on the mean CC shape (black 
curves). 

For multiple components, the sparse principal components procedures 
minimizes 

N K K 

y II Xi - ©V T a; i || 2 + xy IKII 2 + y Aifc||u fc ||i, (14.71) 

2=1 k—1 k— 1 

subject to © 2 © = I k- Here V is a p x K matrix with columns Vk and © 
is also p x K. 

Criterion (14.71) is not jointly convex in V and 0, but it is convex in 
each parameter with the other parameter fixed 7 . Minimization over V with 
© fixed is equivalent to K elastic net problems (Section 18.4) and can be 
done efficiently. On the other hand, minimization over © with V fixed is a 
version of the Procrustes problem (14.56), and is solved by a simple SVD 
calculation (Exercise 14.12). These steps are alternated until convergence. 

Figure 14.31 shows an example of sparse principal components analysis 
using (14.71), taken from Sjostrand et al. (2007). Here the shape of the 
mid-sagittal cross-section of the corpus callosum (CC) is related to various 
clinical parameters in a study involving 569 elderly persons 8 . In this exam- 


7 Note that the usual principal component criterion, for example (14.50), is not jointly 
convex in the parameters either. Nevertheless, the solution is well defined and an efficient 
algorithm is available. 

®We thank Rasmus Larsen and Karl Sjostrand for suggesting this application, and 
supplying us with the postscript figures reproduced here. 
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FIGURE 14.32. An example of a mid-saggital brain slice, with the corpus col- 
losum annotated with landmarks. 


pie PCA is applied to shape data, and is a popular tool in morphometries. 
For such applications, a number of landmarks are identified along the cir¬ 
cumference of the shape; an example is given in Figure 14.32. These are 
aligned by Procrustes analysis to allow for rotations, and in this case scal¬ 
ing as well (see Section 14.5.1). The features used for PCA are the sequence 
of coordinate pairs for each landmark, unpacked into a single vector. 

In this analysis, both standard and sparse principal components were 
computed, and components that were significantly associated with various 
clinical parameters were identified. In the figure, the shape variations cor¬ 
responding to significant principal components (red curves) are overlaid on 
the mean CC shape (black curves). Low walking speed relates to CCs that 
are thinner (displaying atrophy) in regions connecting the motor control 
and cognitive centers of the brain. Low verbal fluency relates to CCs that 
are thinner in regions connecting auditory/visual/cognitive centers. The 
sparse principal components procedure gives a more parsimonious, and po¬ 
tentially more informative picture of the important differences. 
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14.6 Non-negative Matrix Factorization 

Non-negative matrix factorization (Lee and Seung, 1999) is a recent al¬ 
ternative approach to principal components analysis, in which the data 
and components are assumed to be non-negative. It is useful for modeling 
non-negative data such as images. 

The N x p data matrix X is approximated by 

X « WH (14.72) 

where W is N x r and H is r x p, r < ma x(N,p). We assume that 

5 hkj ^ 0 . 

The matrices W and H are found by maximizing 

N p 

L( w, H) = £ log(WH) y - (WH),,-]. (14.73) 

i=l j=1 

This is the log-likelihood from a model in which X{j has a Poisson dis¬ 
tribution with mean (WH),;j—quite reasonable for positive data. 

The following alternating algorithm (Lee and Seung, 2001) converges to 
a local maximum of L( W, H): 

WH)y 

'Wik ^ 'Wik 

WjkXjj/ (WH)jj 

Ei=l w ik 

This algorithm can be derived as a minorization procedure for maximizing 
L(W, H) (Exercise 14.23) and is also related to the iterative-proportional¬ 
scaling algorithm for log-linear models (Exercise 14.24). 

Figure 14.33 shows an example taken from Lee and Seung (1999) 9 , com¬ 
paring non-negative matrix factorization (NMF), vector quantization (VQ, 
equivalent to A:-means clustering) and principal components analysis (PCA). 
The three learning methods were applied to a database of N = 2,429 fa¬ 
cial images, each consisting of 19 x 19 pixels, resulting in a 2,429 x 381 
matrix X. As shown in the 7x7 array of montages (each a 19 x 19 image), 
each method has learned a set of r = 49 basis images. Positive values are 
illustrated with black pixels and negative values with red pixels. A par¬ 
ticular instance of a face, shown at top right, is approximated by a linear 
superposition of basis images. The coefficients of the linear superposition 
are shown next to each montage, in a 7 x 7 array 10 , and the resulting su¬ 
perpositions are shown to the right of the equality sign. The authors point 


hkj <r~ hk 


kj ' 


E 


Ej=i h kj 




9 We thank Sebastian Seung for providing this image. 

10 These 7 X 7 arrangements allow for a compact display, and have no structural 
significance. 
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out that unlike VQ and PCA, NMF learns to represent faces with a set of 
basis images resembling parts of faces. 

Donoho and Stodden (2004) point out a potentially serious problem with 
non-negative matrix factorization. Even in situations where X = WH holds 
exactly, the decomposition may not be unique. Figure 14.34 illustrates the 
problem. The data points lie in p = 2 dimensions, and there is “open space” 
between the data and the coordinate axes. We can choose the basis vectors 
hi and /12 anywhere in this open space, and represent each data point 
exactly with a nonnegative linear combination of these vectors. This non¬ 
uniqueness means that the solution found by the above algorithm depends 
on the starting values, and it would seem to hamper the interpretability of 
the factorization. Despite this interpretational drawback, the non-negative 
matrix factorization and its applications has attracted a lot of interest. 

14 - 6.1 Archetypal Analysis 

This method, due to Cutler and Breiman (1994), approximates data points 
by prototypes that are themselves linear combinations of data points. In 
this sense it has a similar flavor to AT-means clustering. However, rather 
than approximating each data point by a single nearby prototype, archety¬ 
pal analysis approximates each data point by a convex combination of a 
collection of prototypes. The use of a convex combination forces the proto¬ 
types to lie on the convex hull of the data cloud. In this sense, the prototypes 
are “pure,”, or “archetypal.” 

As in (14.72), the N x p data matrix X is modeled as 

X « WH (14.75) 

where W is IV x r and H is r xp. We assume that Wik > 0 and Y2'k=i w ik = 
1 Vi. Hence the N data points (rows of X) in p-dimensional space are 
represented by convex combinations of the r archetypes (rows of H). We 
also assume that 

H = BX (14.76) 

where B is r x TV with bki > 0 and YliLi ^ki = 1 Vfc. Thus the archetypes 
themselves are convex combinations of the data points. Using both (14.75) 
and (14.76) we minimize 

J(W, B) = ||X — WH|| 2 

= | JX — WBX|| 2 (14.77) 

over the weights W and B. This function is minimized in an alternating 
fashion, with each separate minimization involving a convex optimization. 
The overall problem is not convex however, and so the algorithm converges 
to a local minimum of the criterion. 
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FIGURE 14.33. Non-negative matrix factorization (NMF), vector quantization 
(VQ, equivalent to k-means clustering) and principal components analysis (PCA) 
applied to a database of facial images. Details are given in the text. Unlike VQ 
and PCA, NMF learns to represent faces with a set of basis images resembling 
parts of faces. 
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FIGURE 14.34. Non-uniqueness of the non-negative matrix factorization. 
There are 11 data points in two dimensions. Any choice of the basis vectors hi 
and /12 in the open space between the coordinate axes and data, gives an exact 
reconstruction of the data. 


Figure 14.35 shows an example with simulated data in two dimensions. 
The top panel displays the results of archetypal analysis, while the bottom 
panel shows the results from A'-means clustering. In order to best recon¬ 
struct the data from convex combinations of the prototypes, it pays to 
locate the prototypes on the convex hull of the data. This is seen in the top 
panels of Figure 14.35 and is the case in general, as proven by Cutler and 
Breiman (1994). A'-means clustering, shown in the bottom panels, chooses 
prototypes in the middle of the data cloud. 

We can think of A'-means clustering as a special case of the archetypal 
model, in which each row of W has a single one and the rest of the entries 
are zero. 

Notice also that the archetypal model (14.75) has the same general form 
as the non-negative matrix factorization model (14.72). However, the two 
models are applied in different settings, and have somewhat different goals. 
Non-negative matrix factorization aims to approximate the columns of the 
data matrix X, and the main output of interest are the columns of W 
representing the primary non-negative components in the data. Archetypal 
analysis focuses instead on the approximation of the rows of X using the 
rows of H, which represent the archetypal data points. Non-negative matrix 
factorization also assumes that r < p. With r = p, we can get an exact 
reconstruction simply choosing W to be the data X with columns scaled 
so that they sum to 1. In contrast, archetypal analysis requires r < N, 
but allows r > p. In Figure 14.35, for example, p = 2,1V = 50 while 
r = 2,4 or 8. The additional constraint (14.76) implies that the archetypal 
approximation will not be perfect, even if r > p. 

Figure 14.36 shows the results of archetypal analysis applied to the 
database of 3’s displayed in Figure 14.22. The three rows in Figure 14.36 
are the resulting archetypes from three runs, specifying two, three and four 
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FIGURE 14.35. Archetypal analysis (top panels) and K-means clustering (bot¬ 
tom panels) applied to 50 data points drawn from a bivariate Gaussian distribu¬ 
tion. The colored points show the positions of the prototypes in each case. 

archetypes, respectively. As expected, the algorithm has produced extreme 
3’s both in size and shape. 


14.7 Independent Component Analysis and 
Exploratory Projection Pursuit 

Multivariate data are often viewed as multiple indirect measurements aris¬ 
ing from an underlying source, which typically cannot be directly measured. 
Examples include the following: 

• Educational and psychological tests use the answers to questionnaires 
to measure the underlying intelligence and other mental abilities of 
subjects. 

• EEG brain scans measure the neuronal activity in various parts of 
the brain indirectly via electromagnetic signals recorded at sensors 
placed at various positions on the head. 

• The trading prices of stocks change constantly over time, and reflect 
various unmeasured factors such as market confidence, external in- 
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FIGURE 14.36. Archetypal analysis applied to the database of digitized 3’s. The 
rows in the figure show the resulting archetypes from three runs, specifying two, 
three and four archetypes, respectively. 


fluences, and other driving forces that may be hard to identify or 
measure. 

Factor analysis is a classical technique developed in the statistical liter¬ 
ature that aims to identify these latent sources. Factor analysis models 
are typically wed to Gaussian distributions, which has to some extent hin¬ 
dered their usefulness. More recently, independent component analysis has 
emerged as a strong competitor to factor analysis, and as we will see, relies 
on the non-Gaussian nature of the underlying sources for its success. 


14-7.1 Latent Variables and Factor Analysis 

The singular-value decomposition X = UDV T (14.54) has a latent variable 
representation. Writing S = \4NXJ and A 1 = DV r /\/iV, we have X = 
SA 1 , and hence each of the columns of X is a linear combination of the 
columns of S. Now since U is orthogonal, and assuming as before that the 
columns of X (and hence U) each have mean zero, this implies that the 
columns of S have zero mean, are uncorrelated and have unit variance. In 
terms of random variables, we can interpret the SVD, or the corresponding 
principal component analysis (PCA) as an estimate of a latent variable 
model 
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— an<Si + ®i2*5 , 2 + ■' 

H - C^ipSp 

a 2 

= a 2 lSi + 022^2 + •' 

■ • + Q>2pSp 

A p 

= a p iSi + ap 2 S' 2 + • 

+ dppSp 


(14.78) 


or simply X = AS. The correlated Xj are each represented as a linear 
expansion in the uncorrelated, unit variance variables Sg. This is not too 
satisfactory, though, because given any orthogonal p x p matrix R, we can 
write 


X = AS 

= AR T RS 

= A *S*, (14.79) 

and Cov{S*) = RCov(S , )R T = I. Hence there are many such decom¬ 
positions, and it is therefore impossible to identify any particular latent 
variables as unique underlying sources. The SVD decomposition does have 
the property that any rank q < p truncated decomposition approximates 
X in an optimal way. 

The classical factor analysis model, developed primarily by researchers in 
psychometrics, alleviates these problems to some extent; see, for example, 
Mardia et al. (1979). With q < p, a factor analysis model has the form 


A-j 

— anSi + • • 

■ ■ + OlqSq + £\ 

a 2 

= «21 <Si + • ■ 

' ’ + 02 qSq + £2 

Ap 

= a p \Si + • ■ 

a p qSq 4“ £p, 


or X = AS + e. Here S' is a vector of q < p underlying latent variables or 
factors, A is a p x q matrix of factor loadings , and the £j are uncorrelated 
zero-mean disturbances. The idea is that the latent variables Se are com¬ 
mon sources of variation amongst the Xj , and account for their correlation 
structure, while the uncorrelated £j are unique to each Xj and pick up the 
remaining unaccounted variation. Typically the Sg and the £j are modeled 
as Gaussian random variables, and the model is fit by maximum likelihood. 
The parameters all reside in the covariance matrix 

X = AA T + D e , (14.81) 

where T) E = diag[Var(ei),..., Var(e p )]. The Se being Gaussian and un¬ 
correlated makes them statistically independent random variables. Thus a 
battery of educational test scores would be thought to be driven by the 
independent underlying factors such as intelligence, drive and so on. The 
columns of A are referred to as the factor loadings, and are used to name 
and interpret the factors. 
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Unfortunately the identifiability issue (14.79) remains, since A and AR r 
are equivalent in (14.81) for any q x q orthogonal R. This leaves a certain 
subjectivity in the use of factor analysis, since the user can search for ro¬ 
tated versions of the factors that are more easily interpretable. This aspect 
has left many analysts skeptical of factor analysis, and may account for its 
lack of popularity in contemporary statistics. Although we will not go into 
details here, the SVD plays a key role in the estimation of (14.81). For ex¬ 
ample, if the Var(£j) are all assumed to be equal, the leading q components 
of the SVD identify the subspace determined by A. 

Because of the separate disturbances £j for each Xj. factor analysis can 
be seen to be modeling the correlation structure of the Xj rather than the 
covariance structure. This can be easily seen by standardizing the covari¬ 
ance structure in (14.81) (Exercise 14.14). This is an important distinction 
between factor analysis and PCA, although not central to the discussion 
here. Exercise 14.15 discusses a simple example where the solutions from 
factor analysis and PCA differ dramatically because of this distinction. 

14-7.2 Independent Component Analysis 

The independent component analysis (ICA) model has exactly the same 
form as (14.78), except the S? are assumed to be statistically indepen¬ 
dent rather than uncorrelated. Intuitively, lack of correlation determines 
the second-degree cross-moments (covariances) of a multivariate distribu¬ 
tion, while in general statistical independence determines all of the cross¬ 
moments. These extra moment conditions allow us to identify the elements 
of A uniquely. Since the multivariate Gaussian distribution is determined 
by its second moments alone, it is the exception, and any Gaussian inde¬ 
pendent components can be determined only up to a rotation, as before. 
Hence identifiability problems in (14.78) and (14.80) can be avoided if we 
assume that the Se are independent and non-Gaussian. 

Here we will discuss the full p-component model as in (14.78), where the 
Se are independent with unit variance; ICA versions of the factor analysis 
model (14.80) exist as well. Our treatment is based on the survey article 
by Hyvarinen and Oja (2000). 

We wish to recover the mixing matrix A in A = AS. Without loss 
of generality, we can assume that X has already been whitened to have 
Cov(A) = I; this is typically achieved via the SVD described above. This 
in turn implies that A is orthogonal, since S also has covariance I. So 
solving the ICA problem amounts to finding an orthogonal A such that 
the components of the vector random variable S = A T X are independent 
(and non-Gaussian). 

Figure 14.37 shows the power of ICA in separating two mixed signals. 
This is an example of the classical cocktail party problem , where differ¬ 
ent microphones Xj pick up mixtures of different independent sources Se 
(music, speech from different speakers, etc.). ICA is able to perform blind 
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Source Signals 


Measured Signals 



PCA Solution 


ICA Solution 



FIGURE 14.37. Illustration of ICA vs. PCA on artificial time-series data. The 
upper left panel shows the two source signals, measured at 1000 uniformly spaced 
time points. The upper right panel shows the observed mixed signals. The lower 
two panels show the principal components and independent component solutions. 

source separation , by exploiting the independence and non-Gaussianity of 
the original sources. 

Many of the popular approaches to ICA are based on entropy. The dif¬ 
ferential entropy H of a random variable Y with density g{y) is given by 

H 00 = - / 9(y) log g{y)dy. (14.82) 

A well-known result in information theory says that among all random 
variables with equal variance, Gaussian variables have the maximum en¬ 
tropy. Finally, the mutual information I(Y) between the components of the 
random vector Y is a natural measure of dependence: 

I(Y) = J2h(Y j )-H(Y). (14.83) 

l=i 

The quantity I(Y) is called the Kullback-Leibler distance between the 
density g(y) of Y and its independence version Il/Li 9j(Vj)i where gj(yj ) 
is the marginal density of Y). Now if X has covariance I, and Y = A 1 X 
with A orthogonal, then it is easy to show that 

I(Y) = ^ H(Yj) — H(X) — log | det A| 
l=i 

= H(Yj) — H (X). 
l=i 

Finding an A to minimize I(Y) = /( A T X) looks for the orthogonal trans¬ 
formation that leads to the most independence between its components. In 


(14.84) 

(14.85) 
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FIGURE 14.38. Mixtures of independent uniform random variables. The upper 
left panel shows 500 realizations from the two independent uniform sources, the 
upper right panel their mixed versions. The lower two panels show the PCA and 
ICA solutions, respectively. 

light of (14.84) this is equivalent to minimizing the sum of the entropies of 
the separate components of Y, which in turn amounts to maximizing their 
departures from Gaussianity. 

For convenience, rather than using the entropy H(Yj), Hyvarinen and 
Oja (2000) use the negentropy measure J(Yj) defined by 

J(Y j )=H(Z j )-H(Y j ), (14.86) 

where Zj is a Gaussian random variable with the same variance as Yj. Ne¬ 
gentropy is non-negative, and measures the departure of Yj from Gaussian¬ 
ity. They propose simple approximations to negentropy which can be com¬ 
puted and optimized on data. The ICA solutions shown in Figures 14.37 
14.39 use the approximation 

J(Yj) « [EG(Y)) — FiG(Zj)} 2 , (14.87) 

where G(u) = ilogcosh(au) for 1 < a < 2. When applied to a sample 
of Xj, the expectations are replaced by data averages. This is one of the 
options in the FastICA software provided by these authors. More classical 
(and less robust) measures are based on fourth moments, and hence look for 
departures from the Gaussian via kurtosis. See Hyvarinen and Oja (2000) 
for more details. In Section 14.7.4 we describe their approximate Newton 
algorithm for finding the optimal directions. 

In summary then, ICA applied to multivariate data looks for a sequence 
of orthogonal projections such that the projected data look as far from 
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FIGURE 14.39. A comparison of the first five ICA components computed using 
FastICA (above diagonal) with the first five PCA components (below diagonal). 
Each component is standardized to have unit variance. 


Gaussian as possible. With pre-whitened data, this amounts to looking for 
components that are as independent as possible. 

ICA starts from essentially a factor analysis solution, and looks for rota¬ 
tions that lead to independent components. From this point of view, ICA is 
just another factor rotation method, along with the traditional “varimax” 
and “quartimax” methods used in psychometrics. 


Example: Handwritten Digits 

We revisit the handwritten threes analyzed by PCA in Section 14.5.1. Fig¬ 
ure 14.39 compares the first five (standardized) principal components with 
the first five ICA components, all shown in the same standardized units. 
Note that each plot is a two-dimensional projection from a 256-dimensional 
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FIGURE 14.40. The highlighted digits from Figure If .39. By comparing with 
the mean digits, we see the nature of the ICA component. 


space. While the PCA components all appear to have joint Gaussian distri¬ 
butions, the ICA components have long-tailed distributions. This is not too 
surprising, since PCA focuses on variance, while ICA specifically looks for 
non-Gaussian distributions. All the components have been standardized, 
so we do not see the decreasing variances of the principal components. 

For each ICA component we have highlighted two of the extreme digits, 
as well as a pair of central digits and displayed them in Figure 14.40. 
This illustrates the nature of each of the components. For example, ICA 
component five picks up the long sweeping tailed threes. 


Example: EEG Time Courses 

ICA has become an important tool in the study of brain dynamics—the 
example we present here uses ICA to untangle the components of signals 
in multi-channel electroencephalographic (EEG) data (Onton and Makeig, 
2006). 

Subjects wear a cap embedded with a lattice of 100 EEG electrodes, 
which record brain activity at different locations on the scalp. Figure 14.41 11 
(top panel) shows 15 seconds of output from a subset of nine of these elec¬ 
trodes from a subject performing a standard “two-back” learning task over 
a 30 minute period. The subject is presented with a letter (B, H, J, C, F, or 
K) at roughly 1500-ms intervals, and responds by pressing one of two but¬ 
tons to indicate whether the letter presented is the same or different from 
that presented two steps back. Depending on the answer, the subject earns 
or loses points, and occasionally earns bonus or loses penalty points. The 
time-course data show spatial correlation in the EEG signals—the signals 
of nearby sensors look very similar. 

The key assumption here is that signals recorded at each scalp electrode 
are a mixture of independent potentials arising from different cortical ac- 


11 Reprinted from Progress in Brain Research , Vol. 159, Julie Onton and Scott Makeig, 
“Information based modeling of event-related brain dynamics,” Page 106 , Copyright 
(2006), with permission from Elsevier. We thank Julie Onton and Scott Makeig for 
supplying an electronic version of the image. 
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tivities, as well as non-cortical artifact domains; see the reference for a 
detailed overview of ICA in this domain. 

The lower part of Figure 14.41 shows a selection of ICA components. 
The colored images represent the estimated unmixing coefficient vectors dj 
as heatmap images superimposed on the scalp, indicating the location of 
activity. The corresponding time-courses show the activity of the learned 
ICA components. 

For example, the subject blinked after each performance feedback signal 
(colored vertical lines), which accounts for the location and artifact signal 
in IC1 and IC3. IC12 is an artifact associated with the cardiac pulse. IC4 
and IC7 account for frontal theta-band activities, and appear after a stretch 
of correct performance. See Onton and Makeig (2006) for a more detailed 
discussion of this example, and the use of ICA in EEG modeling. 

14-7.3 Exploratory Projection Pursuit 

Friedman and Tukey (1974) proposed exploratory projection pursuit, a 
graphical exploration technique for visualizing high-dimensional data. Their 
view was that most low (one- or two-dimensional) projections of high¬ 
dimensional data look Gaussian. Interesting structure, such as clusters or 
long tails, would be revealed by non-Gaussian projections. They proposed 
a number of projection indices for optimization, each focusing on a differ¬ 
ent departure from Gaussianity. Since their initial proposal, a variety of 
improvements have been suggested (Huber, 1985; Friedman, 1987), and a 
variety of indices, including entropy, are implemented in the interactive 
graphics package Xgobi (Swayne et al., 1991, now called GGobi). These 
projection indices are exactly of the same form as J(Y)) above, where 
Yj = aJX, a normalized linear combination of the components of X. In 
fact, some of the approximations and substitutions for cross-entropy coin¬ 
cide with indices proposed for projection pursuit. Typically with projection 
pursuit, the directions aj are not constrained to be orthogonal. Friedman 
(1987) transforms the data to look Gaussian in the chosen projection, and 
then searches for subsequent directions. Despite their different origins, ICA 
and exploratory projection pursuit are quite similar, at least in the repre¬ 
sentation described here. 

14-7-4 A Direct Approach to ICA 

Independent components have by definition a joint product density 

p 

= (14.88) 

i=1 

so here we present an approach that estimates this density directly us¬ 
ing generalized additive models (Section 9.1). Full details can be found in 
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FIGURE 14.41. Fifteen seconds of EEG data (of 1917 seconds) at nine (of 
100,) scaZp channels (top panel), as well as nine ICA components (lower panel). 
While nearby electrodes record nearly identical mixtures of brain and non-brain 
activity, ICA components are temporally distinct. The colored scalps represent the 
ICA unmixing coefficients a.j as a heatmap, showing brain or scalp location of the 


source. 
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Hastie and Tibshirani (2003), and the method is implemented in the R 
package ProDenICA, available from CRAN. 

In the spirit of representing departures from Gaussianity, we represent 
each fj as 

f j (s j ) = <Ka j )eVM, (14.89) 

a tilted Gaussian density. Here <f> is the standard Gaussian density, and 
gj satisfies the normalization conditions required of a density. Assuming 
as before that X is pre-whitened, the log-likelihood for the observed data 
A = AS 1 is 

N p 

^( A >fe'}i;X) [ 1o S M a J x i)+9j{ajxi)}, (14.90) 

i=i j=i 

which we wish to maximize subject to the constraints that A is orthogonal 
and that the gj result in densities in (14.89). Without imposing any further 
restrictions on gj, the model (14.90) is over-parametrized, so we instead 
maximize a regularized version 



(14.91) 


We have subtracted two penalty terms (for each j) in (14.91), inspired by 
Silverman (1986, Section 5.4.4): 

• The first enforces the density constraint f t/>(t)e^^dt = 1 on any 
solution gj. 

• The second is a roughness penalty, which guarantees that the solution 
gj is a quartic-spline with knots at the observed values of Sij = ajxi. 

It can further be shown that the solution densities fj = <f>e 9i each have 
mean zero and variance one (Exercise 14.18). As we increase A j, these 
solutions approach the standard Gaussian <f>. 


Algorithm 14.3 Product Density ICA Algorithm: ProDenICA 

1. Initialize A (random Gaussian matrix followed by orthogonalization). 

2. Alternate until convergence of A: 

(a) Given A, optimize (14.91) w.r.t. gj (separately for each j). 

(b) Given gj, j = 1 ,... ,p, perform one step of a fixed point algo¬ 
rithm towards finding the optimal A. 


We fit the functions gj and directions a,j by optimizing (14.91) in an 
alternating fashion, as described in Algorithm 14.3. 
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Step 2(a) amounts to a semi-parametric density estimation, which can 
be solved using a novel application of generalized additive models. For 
convenience we extract one of the p separate problems, 

N 

<t>(si) + g{si)] - j cj)(t)e 9{t) dt - A J{g"\t)} 2 (t)dt. (14.92) 

i— 1 

Although the second integral in (14.92) leads to a smoothing spline, the 
first integral is problematic, and requires an approximation. We construct 
a fine grid of L values in increments A covering the observed values s,, 
and count the number of s* in the resulting bins: 

y\ = * Si € ( s l~A/2,s| + A/2) . (1.4.93) 

Typically we pick L to be 1000, which is more than adequate. We can then 
approximate (14.92) by 


E {y* P°s(0( s D) + 9(4)} - A# a ;)e®W>} - A [ g"' 2 (s)ds. (14.94) 

i=i L y 


This last expression can be seen to be proportional to a penalized Poisson 
log-likelihood with response y\j A and penalty parameter A/A, and mean 
/i(s) = (f>(s)e 9 ^ s K This is a generalized additive spline model (Hastie and 
Tibshirani, 1990; Efron and Tibshirani, 1996), with an offset term log^(s), 
and can be fit using a Newton algorithm in 0(L ) operations. Although 
a quartic spline is called for, we find in practice that a cubic spline is 
adequate. We have p tuning parameters A j to set; in practice we make 
them all the same, and specify the amount of smoothing via the effective 
degrees-of-freedom df(A). Our software uses 5df as a default value. 

Step 2(b) in Algorithm 14.3 requires optimizing (14.92) with respect to 
A, holding the gj fixed. Only the first terms in the sum involve A, and 
since A is orthogonal, the collection of terms involving </> do not depend on 
A (Exercise 14.19). Hence we need to maximize 


i p n 

C( A) = ^J2J24( a j x i) (14.95) 

j=i i= l 

= E fi( a j) 

1=1 

C(A) is a log-likelihood ratio between the fitted density and a Gaussian, 
and can be seen as an estimate of negentropy (14.86), with each gj a con¬ 
trast function as in (14.87). The fixed point update in step 2(b) is a modified 
Newton step (Exercise 14.20) 
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1. For each j update 

a j <- E {Xg'(ajX) - E\g](ajX)]a,} , (14.96) 

where E represents expectation w.r.t the sample Xi. Since gj is a fitted 
quartic (or cubic) spline, the first and second derivatives are readily 
available. 

2. Orthogonalize A using the symmetric square-root transformation 
(AA T )-i A. If A = UDV T 'is the SVD of A, it is easy to show that 
this leads to the update A •<— UV T . 

Our ProDenICA algorithm works as well as FastICA on the artificial time 
series data of Figure 14.37, the mixture of uniforms data of Figure 14.38, 
and the digit data in Figure 14.39. 


Example: Simulations 




FIGURE 14.42. The left panel shows 18 distributions used for comparisons. 
These include the “t”, uniform, exponential, mixtures of exponentials, symmetric 
and asymmetric Gaussian mixtures. The right panel shows (on the log scale) 
the average Amari metric for each method and each distribution, based on 30 
simulations in IR 2 for each distribution. 

Figure 14.42 shows the results of a simulation comparing ProDenICA to 
FastICA, and another semi-parametric competitor KernelICA (Bach and 
Jordan, 2002). The left panel shows the 18 distributions used as a basis 
of comparison. For each distribution, we generated a pair of independent 
components (N = 1024), and a random mixing matrix in IR 2 with condition 
number between 1 and 2. We used our R implementations of FastICA, using 
the negentropy criterion (14.87), and ProDenICA. For KernelICA we used 
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the authors MATLAB code. 12 Since the search criteria are nonconvex, we 
used five random starts for each method. Each of the algorithms delivers 
an orthogonal mixing matrix A (the data were pre-whitened ), which is 
available for comparison with the generating orthogonalized mixing matrix 
A 0 . We used the Amari metric (Bach and Jordan, 2002) as a measure of 
the closeness of the two frames: 


d(A 0 ,A) 



(14.97) 


where = (A 0 A _1 )jj. The right panel in Figure 14.42 compares the 
averages (on the log scale) of the Amari metric between the truth and the 
estimated mixing matrices. ProDenICA is competitive with FastICA and 
KernelICA in all situations, and dominates most of the mixture simulations. 


14.8 Multidimensional Scaling 

Both self-organizing maps and principal curves and surfaces map data 
points in 1R P to a lower-dimensional manifold. Multidimensional scaling 
(MDS) has a similar goal, but approaches the problem in a somewhat dif¬ 
ferent way. 

We start with observations X\,X 2 ,- ■ ■ ,%n £ 1R P , and let d^ be the dis¬ 
tance between observations i and j. Often we choose Euclidean distance 
dij = \\'Xi — Ml, but other distances may be used. Further, in some ap¬ 
plications we may not even have available the data points Xi , but only 
have some dissimilarity measure dij (see Section 14.3.10). For example, in 
a wine tasting experiment, dij might be a measure of how different a sub¬ 
ject judged wines i and j, and the subject provides such a measure for all 
pairs of wines i,j. MDS requires only the dissimilarities d^, in contrast to 
the SOM and principal curves and surfaces which need the data points Xi. 

Multidimensional scaling seeks values z\, Z 2 , ■ ■ ■, Zn £ IR fc to minimize 
the so-called stress function 13 

Sm{zi,z 2 , ■ ■ -,z N ) = YXdg, - ||Zi - Ml) 2 - (14.98) 

i^i' 

This is known as least squares or Kruskal-Shephard scaling. The idea is 
to find a lower-dimensional representation of the data that preserves the 
pairwise distances as well as possible. Notice that the approximation is 


12 Francis Bach kindly supplied this code, and helped us set up the simulations. 

13 Some authors define stress as the square-root of Sm’i since it does not affect the 
optimization, we leave it squared to make comparisons with other criteria simpler. 
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in terms of the distances rather than squared distances (which results in 
slightly messier algebra). A gradient descent algorithm is used to minimize 
Sm- 

A variation on least squares scaling is the so-called Sammon mapping 
which minimizes 


■ , - , Q'ii' 

infix' 

Here more emphasis is put on preserving smaller pairwise distances. 

In classical scaling , we instead start with similarities Sw: often we use 
the centered inner product sw = (xi — x,Xi' — x ). The problem then is to 
minimize 


Sc{zi,z 2 , ■ • •, z N ) = - (zi - z, zy - z)) 2 (14.100) 


over zi,z 2 , ■ ■ ■, Zn £ IR , '. This is attractive because there is an explicit 
solution in terms of eigenvectors: see Exercise 14.11. If we have distances 
rather than inner-products, we can convert them to centered inner-products 
if the distances are Euclidean ; 14 see (18.31) on page 671 in Chapter 18. 
If the similarities are in fact centered inner-products, classical scaling is 
exactly equivalent to principal components, an inherently linear dimension- 
reduction technique. Classical scaling is not equivalent to least squares 
scaling; the loss functions are different, and the mapping can be nonlinear. 

Least squares and classical scaling are referred to as metric scaling meth¬ 
ods, in the sense that the actual dissimilarities or similarities are approx¬ 
imated. Shephard-Kruskal nonmetric scaling effectively uses only ranks. 
Nonmetric scaling seeks to minimize the stress function 


^ NM ( Z \, Z<2 , • ■ ■ , Z jv ) 


E^ [\\zi - Zi'\\ - e(dn’)Y 


E 




Zi ~ Zj‘ 


(14.101) 


over the Zt and an arbitrary increasing function 9. With 9 fixed, we min¬ 
imize over Zi by gradient descent. With the Zi fixed, the method of iso¬ 
tonic regression is used to find the best monotonic approximation 9(da’) 
to — Zi'\\. These steps are iterated until the solutions stabilize. 

Like the self-organizing map and principal surfaces, multidimensional 
scaling represents high-dimensional data in a low-dimensional coordinate 
system. Principal surfaces and SOMs go a step further, and approximate 
the original data by a low-dimensional manifold, parametrized in the low 
dimensional coordinate system. In a principal surface and SOM, points 


14 An N x N distance matrix is Euclidean if the entries represent pairwise Euclidean 
distances between N points in some dimensional space. 
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FIGURE 14.43. First two coordinates for half-sphere data, from classical multi¬ 
dimensional scaling. 


close together in the original feature space should map close together on 
the manifold, but points far apart in feature space might also map close 
together. This is less likely in multidimensional scaling since it explicitly 
tries to preserve all pairwise distances. 

Figure 14.43 shows the first two MDS coordinates from classical scaling 
for the half-sphere example. There is clear separation of the clusters, and 
the tighter nature of the red cluster is apparent. 


14.9 Nonlinear Dimension Reduction and Local 
Multidimensional Scaling 

Several methods have been recently proposed for nonlinear dimension re¬ 
duction, similar in spirit to principal surfaces. The idea is that the data lie 
close to an intrinsically low-dimensional nonlinear manifold embedded in a 
high-dimensional space. These methods can be thought of as “flattening” 
the manifold, and hence reducing the data to a set of low-dimensional co¬ 
ordinates that represent their relative positions in the manifold. They are 
useful for problems where signal-to-noise ratio is very high (e.g., physical 
systems), and are probably not as useful for observational data with lower 
signal-to-noise ratios. 

The basic goal is illustrated in the left panel of Figure 14.44. The data 
lie near a parabola with substantial curvature. Classical MDS does not pre- 
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FIGURE 14.44. The orange points show data lying on a parabola, while the blue 
points shows multidimensional scaling representations in one dimension. Classical 
multidimensional scaling (left panel) does not preserve the ordering of the points 
along the curve, because it judges points on opposite ends of the curve to be close 
together. In contrast, local multidimensional scaling (right panel) does a good job 
of preserving the ordering of the points along the curve. 




serve the ordering of the points along the curve, because it judges points 
on opposite ends of the curve to be close together. The right panel shows 
the results of local multi-dimensional scaling , one of the three methods for 
non-linear multi-dimensional scaling that we discuss below. These meth¬ 
ods use only the coordinates of the points in p dimensions, and have no 
other information about the manifold. Local MDS has done a good job of 
preserving the ordering of the points along the curve. 

We now briefly describe three new approaches to nonlinear dimension 
reduction and manifold mapping. 

Isometric feature mapping (ISOMAP) (Tenenbaum et ah, 2000) con¬ 
structs a graph to approximate the geodesic distance between points along 
the manifold. Specifically, for each data point we find its neighbors—points 
within some small Euclidean distance of that point. We construct a graph 
with an edge between any two neighboring points. The geodesic distance 
between any two points is then approximated by the shortest path be¬ 
tween points on the graph. Finally, classical scaling is applied to the graph 
distances, to produce a low-dimensional mapping. 

Local linear embedding (Roweis and Saul, 2000) takes a very different ap¬ 
proach, trying to preserve the local affine structure of the high-dimensional 
data. Each data point is approximated by a linear combination of neigh¬ 
boring points. Then a lower dimensional representation is constructed that 
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best preserves these local approximations. The details are interesting, so 
we give them here. 

1. For each data point Xi in p dimensions, we find its A'-nearest neigh¬ 
bors Af(i) in Euclidean distance. 

2. We approximate each point by an affine mixture of the points in its 
neighborhood: 

min 11a;j- Y w ik x k \\ 2 (14.102) 

w ik 

k£AT(i) 

over weights w ik satisfying w ik = 0 , k <£ Af(i), J2 k =i u ’ik = 1- w ik 
is the contribution of point k to the reconstruction of point i. Note 
that for a hope of a unique solution, we must have K < p. 

3. Finally, we find points yi in a space of dimension d < p to minimize 

N N 

Hu Vi-Y, w lk y k || 2 (14.103) 

t=1 k= 1 


with Wi k fixed. 

In step 3, we minimize 

tr[(Y - WY) t (Y - WY)] = tr[Y T (I - W) T (I - W)Y] (14.104) 

where W is IV x IV; Y is N x d, for some small d < p. The solutions Y 
are the trailing eigenvectors of M = (I — W) T (I — W). Since 1 is a trivial 
eigenvector with eigenvalue 0, we discard it and keep the next d. This has 
the side effect that l 1 Y = 0, and hence the embedding coordinates are 
mean centered. 

Local MDS (Chen and Buja, 2008) takes the simplest and arguably the 
most direct approach. We define Af to be the symmetric set of nearby pairs 
of points; specifically a pair (*, i') is in Af if point i is among the A'-nearest 
neighbors of i'. or vice-versa. Then we construct the stress function 

Yl (~1,22 j * • • , Zjy ) — ^ ( ( da' || Zi Zi' 11) 

(M')eN 

+ Yl w - ( D -\\ z i -Ml) 2 - (14.105) 

Here D is some large constant and w is a weight. The idea is that points 
that are not neighbors are considered to be very far apart; such pairs are 
given a small weight w so that they don’t dominate the overall stress func¬ 
tion. To simplify the expression, we take w ~ 1 /D, and let D —> oo. 
Expanding (14.105), this gives 
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FIGURE 14.45. Images of faces mapped into the embedding space described by 
the first two coordinates of LLE. Next to the circled points, representative faces 
are shown in different parts of the space. The images at the bottom of the plot 
correspond to points along the top right path (linked by solid line), and illustrate 
one particular mode of variability in pose and expression. 
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S L (z 1 ,Z 2 ,-..,Z N ) = ^2 (da'~\\zi - Zi'\\) 2 - t ^2 \\zi~Zi'\\, 

(i,i')£AT 

(14.106) 

where r = 2 wD. The first term in (14.106) tries to preserve local structure 
in the data, while the second term encourages the representations Z{. 
for pairs (i,i r ) that are non-neighbors to be farther apart. Local MDS 
minimizes the stress function (14.106) over Zi, for fixed values of the number 
of neighbors K and the tuning parameter r. 

The right panel of Figure 14.44 shows the result of local MDS, using k = 2 
neighbors and r = 0.01. We used coordinate descent with multiple starting 
values to find a good minimum of the (nonconvex) stress function (14.106). 
The ordering of the points along the curve has been largely preserved, 

Figure 14.45 shows a more interesting application of one of these meth¬ 
ods (LLE) 15 . The data consist of 1965 photographs, digitized as 20 x 28 
grayscale images. The result of the first two-coordinates of LLE are shown 
and reveal some variability in pose and expression. Similar pictures were 
produced by local MDS. 

In experiments reported in Chen and Buja (2008), local MDS shows su¬ 
perior performance, as compared to ISOMAP and LLE. They also demon¬ 
strate the usefulness of local MDS for graph layout. There are also close 
connections between the methods discussed here, spectral clustering (Sec¬ 
tion 14.5.3) and kernel PCA (Section 14.5.4). 


14.10 The Google PageRank Algorithm 

In this section we give a brief description of the original PageRank algo¬ 
rithm used by the Google search engine, an interesting recent application 
of unsupervised learning methods. 

We suppose that we have N web pages and wish to rank them in terms 
of importance. For example, the N pages might all contain a string match 
to “statistical learning” and we might wish to rank the pages in terms of 
their likely relevance to a websurfer. 

The PageRank algorithm considers a webpage to be important if many 
other webpages point to it. However the linking webpages that point to a 
given page are not treated equally: the algorithm also takes into account 
both the importance ( PageRank ) of the linking pages and the number of 
outgoing links that they have. Linking pages with higher PageRank are 
given more weight, while pages with more outgoing links are given less 
weight. These ideas lead to a recursive definition for PageRank , detailed 
next. 


15 


Sam Roweis and Lawrence Saul kindly provided this figure. 
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Let Lij = 1 if page j points to page i, and zero otherwise. Let Cj = 
Sill Lij equal the number of pages pointed to by page j (number of out- 
links). Then the Google PageRanks pi are defined by the recursive rela¬ 
tionship 

N L 

Pi = (1 - d) + d^2(^-)pj (14.107) 


where d is a positive constant (apparently set to 0.85). 

The idea is that the importance of page i is the sum of the importances of 
pages that point to that page. The sums are weighted by 1 /cj, that is, each 
page distributes a total vote of 1 to other pages. The constant d ensures 
that each page gets a PageRank of at least 1 — d. In matrix notation 


p = (1 — d)e + d ■ LD C : p (14.108) 


where e is a vector of N ones and D c = diag(c) is a diagonal matrix with 
diagonal elements Cj. Introducing the normalization e T p = N (i.e., the 
average PageRank is 1), we can write (14.108) as 

p = [(1 — d)ee T /N + dLD c T 1 ]p 

= Ap (14.109) 


where the matrix A is the expression in square braces. 

Exploiting a connection with Markov chains (see below), it can be shown 
that the matrix A has a real eigenvalue equal to one, and one is its largest 
eigenvalue. This means that we can find p by the power method: starting 
with some p = p 0 we iterate 


p fc t-Ap fc _i; p fc «- —. (14.110) 

e 1 Pfc 

The fixed points p are the desired PageRanks. 

In the original paper of Page et al. (1998), the authors considered PageR¬ 
ank as a model of user behavior, where a random web surfer clicks on links 
at random, without regard to content. The surfer does a random walk on 
the web, choosing among available outgoing links at random. The factor 
1 — d is the probability that he does not click on a link, but jumps instead 
to a random webpage. 

Some descriptions of PageRank have (1 — d)/N as the first term in def¬ 
inition (14.107), which would better coincide with the random surfer in¬ 
terpretation. Then the page rank solution (divided by N) is the stationary 
distribution of an irreducible, aperiodic Markov chain over the N webpages. 

Definition (14.107) also corresponds to an irreducible, aperiodic Markov 
chain, with different transition probabilities than those from he (1 — d)/N 
version. Viewing PageRank as a Markov chain makes clear why the matrix 
A has a maximal real eigenvalue of 1. Since A has positive entries with 
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Page 2 



FIGURE 14.46. PageRank algorithm: example of a small network 


each column summing to one, Markov chain theory tells us that it has a 
unique eigenvector with eigenvalue one, corresponding to the stationary 
distribution of the chain (Bremaud, 1999). 

A small network is shown for illustration in Figure 14.46. The link matrix 
is 

/0 0 1 

T _ 1 0 0 

L ~ 1 1 0 

\0 0 0 

and the number of outlinks is c = (2,1,1,1). 

The PageRank solution is p = (1.49,0.78,1.58,0.15). Notice that page 4 
has no incoming links, and hence gets the minimum PageRank of 0.15. 


°\ 

0 

1 

°y 


(14.111) 
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There are many books on clustering, including Hartigan (1975), Gordon 
(1999) and Kaufman and Rousseeuw (1990). iF-means clustering goes back 
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Applications in engineering, especially in image compression via vector 
quantization, can be found in Gersho and Gray (1992). The /c-medoid pro¬ 
cedure is described in Kaufman and Rousseeuw (1990). Association rules 
are outlined in Agrawal et al. (1995). The self-organizing map was proposed 
by Kohonen (1989) and Kohonen (1990); Kohonen et al. (2000) give a more 
recent account. Principal components analysis and multidimensional scal¬ 
ing are described in standard books on multivariate analysis, for example, 
Mardia et al. (1979). Buja et al. (2008) have implemented a powerful en¬ 
vironment called Ggvis for multidimensional scaling, and the user manual 
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contains a lucid overview of the subject. Figures 14.17, 14.21 (left panel) 
and 14.28 (left panel) were produced in Xgobi, a multidimensional data 
visualization package by the same authors. GGobi is a more recent im¬ 
plementation (Cook and Swayne, 2007). Goodall (1991) gives a technical 
overview of Procrustes methods in statistics, and Ramsay and Silverman 
(1997) discuss the shape registration problem. Principal curves and surfaces 
were proposed in Hastie (1984) and Hastie and Stuetzle (1989). The idea of 
principal points was formulated in Flury (1990), Tarpey and Flury (1996) 
give an exposition of the general concept of self-consistency. An excellent 
tutorial on spectral clustering can be found in von Luxburg (2007); this was 
the main source for Section 14.5.3. Luxborg credits Donath and Hoffman 
(1973) and Fiedler (1973) with the earliest work on the subject. A history 
of spectral clustering my be found in Spielman and Teng (1996). Indepen¬ 
dent component analysis was proposed by Cornon (1994), with subsequent 
developments by Bell and Sejnowski (1995); our treatment in Section 14.7 
is based on Hyvarinen and Oja (2000). Projection pursuit was proposed by 
Friedman and Tukey (1974), and is discussed in detail in Huber (1985). A 
dynamic projection pursuit algorithm is implemented in GGobi. 


Exercises 


Ex. 14.1 Weights for clustering. 


Show that weighted Euclidean distance 


d\™\xi,Xi>) 


E?=i m(xu - Xiu) 2 
Ef=i vn 


satisfies 


p 

d£°\xi,Xi') = d e (zi,Zi /) = y>, - zi'i) 2 , (14.112) 

i=i 

where 

zu = Xu • ( = 4 ^ — ) . (14.113) 

VEz=i wij 

Thus weighted Euclidean distance based on x is equivalent to unweighted 
Euclidean distance based on z. 


Ex. 14.2 Consider a mixture model density in p-dimensional feature space, 

K 

d(x) = n k9k{x), (14.114) 

k =1 

where gk = N(g.k, L -a 2 ) and irk > 0 Vfc with irk = 1. Here {fj.k, 7Tfc}, k = 
1,..., K and a 2 are unknown parameters. 




580 


14. Unsupervised Learning 


Suppose we have data xi, X 2 , ■ ■ ■, xjv ~ g(a;) and we wish to fit the mix¬ 
ture model. 


1. Write down the log-likelihood of the data 

2. Derive an EM algorithm for computing the maximum likelihood es¬ 
timates (see Section 8.1). 

3. Show that if a has a known value in the mixture model and we take 
cr —> 0, then in a sense this EM algorithm coincides with AT-means 
clustering. 

Ex. 14.3 In Section 14.2.6 we discuss the use of CART or PRIM for con¬ 
structing generalized association rules. Show that a problem occurs with ei¬ 
ther of these methods when we generate the random data from the product- 
marginal distribution; i.e., by randomly permuting the values for each of 
the variables. Propose ways to overcome this problem. 


Ex. 14.4 Cluster the demographic data of Table 14.1 using a classification 
tree. Specifically, generate a reference sample of the same size of the train¬ 
ing set, by randomly permuting the values within each feature. Build a 
classification tree to the training sample (class 1) and the reference sample 
(class 0) and describe the terminal nodes having highest estimated class 1 
probability. Compare the results to the PRIM results near Table 14.1 and 
also to the results of iGmeans clustering applied to the same data. 


Ex. 14.5 Generate data with three features, with 30 data points in each of 
three classes as follows: 


0 i 

= 

U (—7r/8, 7 t/8) 

0 i 

= 

U( 0,2tt) 

Xl 

= 

sin(0i) cos(</>i) + Wu 

yi 

= 

sin( 6 h) sin(0i) + W 12 

Zl 

= 

cos( 6 h) + W 13 

02 

= 

U(n/2 — 7t/4, 7t/2 + 7 r 

0 2 

= 

U (—7t/4, 7t/4) 

X2 

= 

sin(# 2 ) cos(02) + W 21 

V2 

= 

sin(# 2 ) sin( 02 ) + W 22 

Z2 

= 

cos(0 2 ) + W 23 

e 3 

= 

U(tt/2 — 7r/4, 7 t/2 + 7T 

0 3 

= 

U(tt/2 — 7t/4, 7t/2 + 7T 

X3 

= 

sin(0 3 ) cos(0 3 ) + W 31 

y-i 

= 

sin(0 3 ) sin(0 3 ) + W 32 

Z 3 

= 

cos(0 3 ) + W 33 


Here U(a,b) indicates a uniform variate on the range [a, b] and Wjk are 
independent normal variates with standard deviation 0.6. Hence the data 
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lie near the surface of a sphere in three clusters centered at (1,0, 0), (0,1, 0) 
and (0, 0,1). 

Write a program to fit a SOM to these data, using the learning rates 
given in the text. Carry out a K -means clustering of the same data, and 
compare the results to those in the text. 

Ex. 14.6 Write programs to implement AT-means clustering and a self¬ 
organizing map (SOM), with the prototype lying on a two-dimensional 
grid. Apply them to the columns of the human tumor microarray data, us¬ 
ing K = 2,5,10, 20 centroids for both. Demonstrate that as the size of the 
SOM neighborhood is taken to be smaller and smaller, the SOM solution 
becomes more similar to the TL-means solution. 

Ex. 14.7 Derive (14.51) and (14.52) in Section 14.5.1. Show that /t is not 
unique, and characterize the family of equivalent solutions. 

Ex. 14.8 Derive the solution (14.57) to the Procrustes problem (14.56). 
Derive also the solution to the Procrustes problem with scaling (14.58). 

Ex. 14.9 Write an algorithm to solve 

L 

min V||X £ R £ -M|||. (14.115) 

Apply it to the three S’s, and compare the results to those shown in Fig¬ 
ure 14.26. 

Ex. 14.10 Derive the solution to the affine-invariant average problem (14.60). 
Apply it to the three S’s, and compare the results to those computed in 
Exercise 14.9. 

Ex. 14.11 Classical multidimensional scaling. Let S be the centered in¬ 
ner product matrix with elements (xi — x,Xj — x). Let Ai > A 2 > • • • > 
A k be the k largest eigenvalues of S, with associated eigenvectors E*, = 
(ei,e 2 ,.. • ,efc). Let D& be a diagonal matrix with diagonal entries y/Xi, 
1 /A 2 , .. -, y/\k- Show that the solutions Zi to the classical scaling problem 
(14.100) are the rows of E^D*,. 

Ex. 14.12 Consider the sparse PCA criterion (14.71). 

1. Show that with © fixed, solving for V amounts to K separate elastic- 
net regression problems, with responses the K elements of & T Xi. 

2. Show that with V fixed, solving for © amounts to a reduced-rank 
version of the Procrustes problem, which reduces to 

maxtrace(© T M) subject to © T 0 = Ik, (14.116) 

where M and © are both p x K with K < p. If M = UDQ T is the 
SVD of M, show that the optimal © = UQ T . 
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Ex. 14.13 Generate 200 data points with three features, lying close to a 
helix. In detail, define X\ = cos(s) + 0.1 • Z\. X 2 = sin(s) + 0.1 ■ Z 2 , A 3 = 
s + 0.1 • Z 3 where s takes on 200 equally spaced values between 0 and 27t, 
and Z \, Z 2 , Z A are independent and have standard Gaussian distributions. 

(a) Fit a principal curve to the data and plot the estimated coordinate 
functions. Compare them to the underlying functions cos(s), sin(s) 
and s. 

(b) Fit a self-organizing map to the same data, and see if you can discover 

the helical shape of the original point cloud. 

Ex. 14.14 Pre- and post-multiply equation (14.81) by a diagonal matrix 
containing the inverse variances of the Xj. Hence obtain an equivalent 
decomposition for the correlation matrix, in the sense that a simple scaling 
is applied to the matrix A. 

Ex. 14.15 Generate 200 observations of three variates Xi, A 2 , A 3 according 
to 


Xi ~ Z 1 

X 2 = X 1 + 0.001 • Z 2 

A 3 = 10 -Z 3 (14.117) 

where Z\. Z 2 , Z 3 are independent standard normal variates. Compute the 
leading principal component and factor analysis directions. Hence show 
that the leading principal component aligns itself in the maximal variance 
direction A 3 , while the leading factor essentially ignores the uncorrelated 
component A 3 , and picks up the correlated component X 2 + Xi (Geoffrey 
Hinton, personal communication). 

Ex. 14.16 Consider the kernel principal component procedure outlined in 
Section 14.5.4. Argue that the number M of principal components is equal 
to the rank of K, which is the number of non-zero elements in D. Show 
that the mth component z m (mth column of Z) can be written (up to 
centering) as a jmK( x i,Xj), where ctj m = Uj m /d m . Show that 

the mapping of a new observation xo to the mth component is given by 

^Om — — 1 Oy'm-^(*£0 5 ) * 

Ex. 14.17 Show that with g\(x) = CjK(x,Xj), the solution to (14.66) 
is given by Cj = Uji/d\, where u 3 is the first column of U in (14.65), and 
d\ the first diagonal element of D. Show that the second and subsequent 
principal component functions are defined in a similar manner ( hint: see 
Section 5.8.1.) 

Ex. 14.18 Consider the regularized log-likelihood for the density estimation 
problem arising in ICA, 
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4 E t lo g <K s i) +5(si)] - [ 0(t)e 9{t) dt - A f {g'"{t)} 2 {t)dt. (14.118) 

i=1 •' 

The solution g is a quartic smoothing spline, and can be written as g(s) = 
q(s) + q±(s), where q is a quadratic function (in the null space of the 
penalty). Let q(s) = 9 q + 0\s + O 2 S 2 . By examining the stationarity condi¬ 
tions for 9k, k = 1,2,3, show that the solution / = </>e® is a density, and 
has mean zero and variance one. If we used a second-derivative penalty 
J{g"(t)} 2 (t)dt instead, what simple modification could we make to the 
problem to maintain the three moment conditions? 

Ex. 14.19 If A is p x p orthogonal, show that the first term in (14.92) on 
page 568 

p N 

EE log (t>{ajxi), 

3=1 i =1 

with a,j the jth column of A, does not depend on A. 

Ex. 14.20 Fixed point algorithm for ICA (Hyvarinen et ah, 2001). Consider 
maximizing C(a ) = E{g(a T X)} with respect to a, with ||a|| = 1 and 
Cov(A) = I. Use a Lagrange multiplier to enforce the norm constraint, 
and write down the first two derivatives of the modified criterion. Use the 
approximation 


E{II T S "(a T I)} w E{XX T }E{g'\a T X)} 

to show that the Newton update can be written as the fixed-point update 
(14.96). 

Ex. 14.21 Consider an undirected graph with non-negative edge weights 
wa> and graph Laplacian L. Suppose there are m connected components 
Ai,A 2 ,..., A m in the graph. Show that there are m eigenvectors of L corre¬ 
sponding to eigenvalue zero, and the indicator vectors of these components 
Ia 1 , Ia 2 j • • • j lA m span the zero eigenspace. 

Ex. 14.22 

(a) Show that definition (14.108) implies that the sum of the PageRanks 
Pi is N, the number of web pages. 

(b) Write a program to compute the PageRank solutions by the power 
method using formulation (14.107). Apply it to the network of Fig¬ 
ure 14.47. 


Ex. 14.23 Algorithm for non-negative matrix factorization (Wu and Lange, 
2007). A function g{x,y ) to said to minorize a function f(x) if 


584 


14. Unsupervised Learning 



Page 5 


Page 2 


\ 


Page 6 


FIGURE 14.47. Example of a small network. 


y) < f(x), g(x,x) = f(x) 


(14.119) 


for all x, y in the domain. This is useful for maximizing f{x) since it is easy 
to show that f(x) is nondecreasing under the update 


x s+1 = argma x x g(x,x s ) 


(14.120) 


There are analogous definitions for majorization , for minimizing a function 
f(x). The resulting algorithms are known as MM algorithms, for “minorize- 
maximize” or “majorize-minimize” (Lange, 2004). It also can be shown 
that the EM algorithm (8.5) is an example of an MM algorithm: see Sec¬ 
tion 8.5.3 and Exercise 8.2 for details. 

(a) Consider maximization of the function L(W,H) in (14.73), written 
here without the matrix notation 



Using the concavity of log(aj), show that for any set of r values j/fe > 0 
and 0 < Cfe < 1 with J2k=i c k = U 



Hence 



where 


r 



and s indicates the current iteration. 
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(b) Hence show that, ignoring constants, the function 


ff(W, H | W s , H s ) = £££ Uij ^ 'Wik H“ h kj ^ 

2=1 j =1 k =1 ^ 

N p r 

-EEE Wikhkj 

i— 1 i=l fc=l 

minorizes i(W,H). 

(c) Set the partial derivatives of <?(W,H | W S ,H S ) to zero and hence 
derive the updating steps (14.74). 

Ex. 14.24 Consider the non-negative matrix factorization (14.72) in the 
rank one case (r = 1). 


(a) Show that the updates (14.74) reduce to 


Wi •<— Wi 


hj hj 


Ei=i x ij 

ELi w i h 


N 

=1 *ij 


E iV 

i=l 


E£Li 


(14.121) 


where re, = yjji, hj = h\j. This is an example of the iterative pro¬ 
portional scaling procedure, applied to the independence model for a 
two-way contingency table (Fienberg, 1977, for example). 

(b) Show that the final iterates have the explicit form 


Wi=C- 




T n T p X - 
Z^2=l l^j = 1 


hk — 


2-jj =1 x ik 

T n T p X 
Z^i=i Zjj=l 


(14.122) 


for any constant c > 0. These are equivalent to the usual row and 
column estimates for a two-way independence model. 


Ex. 14.25 Fit a non-negative matrix factorization model to the collection 
of two’s in the digits database. Use 25 basis elements, and compare with a 
24- component (plus mean) PCA model. In both cases display the W and 
H matrices as in Figure 14.33. 
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15 

Random Forests 


15.1 Introduction 

Bagging or bootstrap aggregation (section 8.7) is a technique for reducing 
the variance of an estimated prediction function. Bagging seems to work 
especially well for high-variance, low-bias procedures, such as trees. For 
regression, we simply fit the same regression tree many times to bootstrap- 
sampled versions of the training data, and average the result. For classifi¬ 
cation, a committee of trees each cast a vote for the predicted class. 

Boosting in Chapter 10 was initially proposed as a committee method as 
well, although unlike bagging, the committee of weak learners evolves over 
time, and the members cast a weighted vote. Boosting appears to dominate 
bagging on most problems, and became the preferred choice. 

Random forests (Breiman, 2001) is a substantial modification of bagging 
that builds a large collection of de-correlated trees, and then averages them. 
On many problems the performance of random forests is very similar to 
boosting, and they are simpler to train and tune. As a consequence, random 
forests are popular, and are implemented in a variety of packages. 


15.2 Definition of Random Forests 

The essential idea in bagging (Section 8.7) is to average many noisy but 
approximately unbiased models, and hence reduce the variance. Trees are 
ideal candidates for bagging, since they can capture complex interaction 
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Algorithm 15.1 Random Forest for Regression or Classification. 

1. For b = 1 to B: 

(a) Draw a bootstrap sample Z* of size N from the training data. 

(b) Grow a random-forest tree Xj to the bootstrapped data, by re¬ 
cursively repeating the following steps for each terminal node of 
the tree, until the minimum node size n m i n is reached. 

i. Select m variables at random from the p variables. 

ii. Pick the best variable/split-point among the m. 

iii. Split the node into two daughter nodes. 

2. Output the ensemble of trees {Xh}f • 

To make a prediction at a new point x: 

Regression: / r f(x) = jj Ef= i T b(x). 

Classification: Let Cb{x) be the class prediction of the 6 th random-forest 
tree. Then C^(x) = majority vote {Cb{x)} f. 


structures in the data, and if grown sufficiently deep, have relatively low 
bias. Since trees are notoriously noisy, they benefit greatly from the averag¬ 
ing. Moreover, since each tree generated in bagging is identically distributed 
(i.d.), the expectation of an average of B such trees is the same as the ex¬ 
pectation of any one of them. This means the bias of bagged trees is the 
same as that of the individual trees, and the only hope of improvement is 
through variance reduction. This is in contrast to boosting, where the trees 
are grown in an adaptive way to remove bias, and hence are not i.d. 

An average of B i.i.d. random variables, each with variance er 2 , has vari¬ 
ance jjcr 2 . If the variables are simply i.d. (identically distributed, but not 
necessarily independent) with positive pairwise correlation p, the variance 
of the average is (Exercise 15.1) 

per 2 + 1 -^cr 2 . (15.1) 

As B increases, the second term disappears, but the first remains, and 
hence the size of the correlation of pairs of bagged trees limits the benefits 
of averaging. The idea in random forests (Algorithm 15.1) is to improve 
the variance reduction of bagging by reducing the correlation between the 
trees, without increasing the variance too much. This is achieved in the 
tree-growing process through random selection of the input variables. 

Specifically, when growing a tree on a bootstrapped dataset: 

Before each split, select m < p of the input variables at random 
as candidates for splitting. 






15.2 Definition of Random Forests 


589 


Typically values for m are ypp or even as low as 1. 

After B such trees {T(x\ 0&)}f are grown, the random forest (regression) 
predictor is 

/rf = (15-2) 

6=1 

As in Section 10.9 (page 356), Of, characterizes the 6th random forest tree in 
terms of split variables, outpoints at each node, and terminal-node values. 
Intuitively, reducing m will reduce the correlation between any pair of trees 
in the ensemble, and hence by (15.1) reduce the variance of the average. 


Spam Data 



FIGURE 15.1. Bagging, random forest, and gradient boosting, applied to the 
spam data. For boosting, 5-node trees were used, and the number of trees were 
chosen by 10 -fold cross-validation (2500 trees). Each “step” in the figure corre¬ 
sponds to a change in a single misclassification (in a test set of 1536,). 


Not all estimators can be improved by shaking up the data like this. 
It seems that highly nonlinear estimators, such as trees, benefit the most. 
For bootstrapped trees, p is typically small (0.05 or lower is typical; see 
Figure 15.9), while <r 2 is not much larger than the variance for the original 
tree. On the other hand, bagging does not change linear estimates, such 
as the sample mean (hence its variance either); the pairwise correlation 
between bootstrapped means is about 50% (Exercise 15.4). 
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Random forests are popular. Leo Breiman’s 1 collaborator Adele Cutler 
maintains a random forest website 2 where the software is freely available, 
with more than 3000 downloads reported by 2002. There is a randomForest 
package in R, maintained by Andy Liaw, available from the CRAN website. 

The authors make grand claims about the success of random forests: 
“most accurate,” “most interpretable,” and the like. In our experience ran¬ 
dom forests do remarkably well, with very little tuning required. A ran¬ 
dom forest classifier achieves 4.88% misclassification error on the spam test 
data, which compares well with all other methods, and is not significantly 
worse than gradient boosting at 4.5%. Bagging achieves 5.4% which is 
significantly worse than either (using the McNemar test outlined in Ex¬ 
ercise 10.6), so it appears on this example the additional randomization 
helps. 


Nested Spheres 



RF-1 RF-3 Bagging GBM-1 GBM-6 


FIGURE 15.2. The results of 50 simulations from the “nested spheres” model in 
IR 10 . The Bayes decision boundary is the surface of a sphere (additive). “RF-3” 
refers to a random forest with m — 3, and “GBM-6” a gradient boosted model 
with interaction order six; similarly for “RF-1” and “GBM-1.” The training sets 
were of size 2000, and the test sets 10,000. 


Figure 15.1 shows the test-error progression on 2500 trees for the three 
methods. In this case there is some evidence that gradient boosting has 
started to overfit, although 10-fold cross-validation chose all 2500 trees. 


1 Sadly, Leo Breiman died in July, 2005. 

2 http: //www. math. usu. edu/~adele/f orests/ 
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California Housing Data 



FIGURE 15.3. Random forests compared to gradient boosting on the California 
housing data. The curves represent mean absolute error on the test data as a 
function of the number of trees in the models. Two random forests are shown, with 
m = 2 and m — 6. The two gradient boosted models use a shrinkage parameter 
v — 0.05 in (10.41), and have interaction depths of 4 and 6. The boosted models 
outperform random forests. 

Figure 15.2 shows the results of a simulation 3 comparing random forests 
to gradient boosting on the nested spheres problem [Equation (10.2) in 
Chapter 10]. Boosting easily outperforms random forests here. Notice that 
smaller m is better here, although part of the reason could be that the true 
decision boundary is additive. 

Figure 15.3 compares random forests to boosting (with shrinkage) in a 
regression problem, using the California housing data (Section 10.14.1). 
Two strong features that emerge are 

• Random forests stabilize at about 200 trees, while at 1000 trees boost¬ 
ing continues to improve. Boosting is slowed down by the shrinkage, 
as well as the fact that the trees are much smaller. 

• Boosting outperforms random forests here. At 1000 terms, the weaker 
boosting model (CBM depth 4) has a smaller error than the stronger 


3 Details: The random forests were fit using the R package randomForest 4.5-11, 
with 500 trees. The gradient boosting models were fit using R package gbm 1.5, with 
shrinkage parameter set to 0.05, and 2000 trees. 
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FIGURE 15.4. OOB error computed on the spam training data, compared to the 
test error computed on the test set. 

random forest (RF m = 6); a Wilcoxon test on the mean differences 
in absolute errors has a p-value of 0.007. For larger m the random 
forests performed no better. 


15.3 Details of Random Forests 

We have glossed over the distinction between random forests for classifica¬ 
tion versus regression. When used for classification, a random forest obtains 
a class vote from each tree, and then classifies using majority vote (see Sec¬ 
tion 8.7 on bagging for a similar discussion). When used for regression, the 
predictions from each tree at a target point x are simply averaged, as in 
(15.2). In addition, the inventors make the following recommendations: 

• For classification, the default value for m is [\/pJ and the minimum 
node size is one. 

• For regression, the default value for m is [p/3J and the minimum 
node size is five. 

In practice the best values for these parameters will depend on the problem, 
and they should be treated as tuning parameters. In Figure 15.3 m = 6 
performs much better than the default value |_8/3J =2. 

15.3.1 Out of Bag Samples 

An important feature of random forests is its use of out-of-bag (oob) sam¬ 
ples: 
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For each observation Zi = ( Xi,yi), construct its random forest 
predictor by averaging only those trees corresponding to boot¬ 
strap samples in which Zi did not appear. 

An OOB error estimate is almost identical to that obtained by IV-fold cross- 
validation; see Exercise 15.2. Hence unlike many other nonlinear estimators, 
random forests can be fit in one sequence, with cross-validation being per¬ 
formed along the way. Once the OOB error stabilizes, the training can be 
terminated. 

Figure 15.4 shows the OOB misclassification error for the spam data, com¬ 
pared to the test error. Although 2500 trees are averaged here, it appears 
from the plot that about 200 would be sufficient. 


15.3.2 Variable Importance 

Variable importance plots can be constructed for random forests in exactly 
the same way as they were for gradient-boosted models (Section 10.13). 
At each split in each tree, the improvement in the split-criterion is the 
importance measure attributed to the splitting variable, and is accumulated 
over all the trees in the forest separately for each variable. The left plot 
of Figure 15.5 shows the variable importances computed in this way for 
the spam data; compare with the corresponding Figure 10.6 on page 354 for 
gradient boosting. Boosting ignores some variables completely, while the 
random forest does not. The candidate split-variable selection increases 
the chance that any single variable gets included in a random forest, while 
no such selection occurs with boosting. 

Random forests also use the OOB samples to construct a different variable- 
importance measure, apparently to measure the prediction strength of each 
variable. When the 6th tree is grown, the OOB samples are passed down 
the tree, and the prediction accuracy is recorded. Then the values for the 
j th variable are randomly permuted in the OOB samples, and the accuracy 
is again computed. The decrease in accuracy as a result of this permuting 
is averaged over all trees, and is used as a measure of the importance of 
variable j in the random forest. These are expressed as a percent of the 
maximum in the right plot in Figure 15.5. Although the rankings of the 
two methods are similar, the importances in the right plot are more uni¬ 
form over the variables. The randomization effectively voids the effect of 
a variable, much like setting a coefficient to zero in a linear model (Exer¬ 
cise 15.7). This does not measure the effect on prediction were this variable 
not available, because if the model was refitted without the variable, other 
variables could be used as surrogates. 
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Variable Importance 


Variable Importance 


FIGURE 15.5. Variable importance plots for a classification random forest 
grown on the spam data. The left plot bases the importance on the Gini split¬ 
ting index, as in gradient boosting. The rankings compare well with the rankings 
produced by gradient boosting (Figure 10.6 on page 35f). The right plot uses OOB 
randomization to compute variable importances, and tends to spread the impor¬ 
tances more uniformly. 
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Proximity Plot Random Forest Classifier 



FIGURE 15.6. (Left): Proximity plot for a random forest classifier grown to 
the mixture data. (Right): Decision boundary and training data for random forest 
on mixture data. Six points have been identified in each plot. 


15.3.3 Proximity Plots 

One of the advertised outputs of a random forest is a proximity plot. Fig¬ 
ure 15.6 shows a proximity plot for the mixture data defined in Section 2.3.3 
in Chapter 2. In growing a random forest, an N x N proximity matrix is 
accumulated for the training data. For every tree, any pair of OOB obser¬ 
vations sharing a terminal node has their proximity increased by one. This 
proximity matrix is then represented in two dimensions using multidimen¬ 
sional scaling (Section 14.8). The idea is that even though the data may be 
high-dimensional, involving mixed variables, etc., the proximity plot gives 
an indication of which observations are effectively close together in the eyes 
of the random forest classifier. 

Proximity plots for random forests often look very similar, irrespective of 
the data, which casts doubt on their utility. They tend to have a star shape, 
one arm per class, which is more pronounced the better the classification 
performance. 

Since the mixture data are two-dimensional, we can map points from the 
proximity plot to the original coordinates, and get a better understanding of 
what they represent. It seems that points in pure regions class-wise map to 
the extremities of the star, while points nearer the decision boundaries map 
nearer the center. This is not surprising when we consider the construction 
of the proximity matrices. Neighboring points in pure regions will often 
end up sharing a bucket, since when a terminal node is pure, it is no longer 
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split by a random forest tree-growing algorithm. On the other hand, pairs 
of points that are close but belong to different classes will sometimes share 
a terminal node, but not always. 


15.3.4 Random Forests and Overfitting 

When the number of variables is large, but the fraction of relevant variables 
small, random forests are likely to perform poorly with small m. At each 
split the chance can be small that the relevant variables will be selected. 
Figure 15.7 shows the results of a simulation that supports this claim. De¬ 
tails are given in the figure caption and Exercise 15.3. At the top of each 
pair we see the hyper-geometric probability that a relevant variable will be 
selected at any split by a random forest tree (in this simulation, the relevant 
variables are all equal in stature). As this probability gets small, the gap 
between boosting and random forests increases. When the number of rele¬ 
vant variables increases, the performance of random forests is surprisingly 
robust to an increase in the number of noise variables. For example, with 6 
relevant and 100 noise variables, the probability of a relevant variable being 
selected at any split is 0.46, assuming m = ^/(6 + 100) ss 10. According to 
Figure 15.7, this does not hurt the performance of random forests compared 
with boosting. This robustness is largely due to the relative insensitivity of 
misclassification cost to the bias and variance of the probability estimates 
in each tree. We consider random forests for regression in the next section. 

Another claim is that random forests “cannot overfit” the data. It is 
certainly true that increasing B does not cause the random forest sequence 
to overfit; like bagging, the random forest estimate (15.2) approximates the 
expectation 

/ rf (a;) = E©T(x;0) = lim f{x)% (15.3) 

B—>00 

with an average over B realizations of 0. The distribution of 0 here is con¬ 
ditional on the training data. However, this limit can overfit the data ; the 
average of fully grown trees can result in too rich a model, and incur unnec¬ 
essary variance. Segal (2004) demonstrates small gains in performance by 
controlling the depths of the individual trees grown in random forests. Our 
experience is that using full-grown trees seldom costs much, and results in 
one less tuning parameter. 

Figure 15.8 shows the modest effect of depth control in a simple regression 
example. Classifiers are less sensitive to variance, and this effect of over¬ 
fitting is seldom seen with random-forest classification. 


15.4 Analysis of Random Forests 
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FIGURE 15.7. A comparison of random forests and gradient boosting on prob¬ 
lems with increasing numbers of noise variables. In each case the true decision 
boundary depends on two variables, and an increasing number of noise variables 
are included. Random forests uses its default value m = ^fp. At the top of each 
pair is the probability that one of the relevant variables is chosen at any split. 
The results are based on 50 simulations for each pair, with a training sample of 
300, and a test sample of 500. 

15.4 Analysis of Random Forests 

In this section we analyze the mechanisms at play with the additional 
randomization employed by random forests. For this discussion we focus 
on regression and squared error loss, since this gets at the main points, 
and bias and variance are more complex with 0-1 loss (see Section 7.3.1). 
Furthermore, even in the case of a classification problem, we can consider 
the random-forest average as an estimate of the class posterior probabilities, 
for which bias and variance are appropriate descriptors. 



15-4-1 Variance and the De-Correlation Effect 

The limiting form (B —> oo) of the random forest regression estimator is 

/rf(z)=Ee| Z T(z;e(Z)), (15.4) 

where we have made explicit the dependence on the training data Z. Here 
we consider estimation at a single target point x. From (15.1) we see that 
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FIGURE 15.8. The effect of tree size on the error in random forest regres¬ 
sion. In this example, the true surface was additive in two of the 12 variables, 
plus additive unit-variance Gaussian noise. Tree depth is controlled here by the 
minimum node size; the smaller the minimum node size, the deeper the trees. 

Var/ r f (a:) = p{x)a 2 {x). (15.5) 


Here 

• p{x) is the sampling correlation between any pair of trees used in the 
averaging: 

p{x) = corr[T(a:; 0i(Z)), T{x; 0 2 (Z))], (15.6) 

where 0i(Z) and 0 2 (Z) are a randomly drawn pair of random forest 
trees grown to the randomly sampled Z; 

• cr 2 (x) is the sampling variance of any single randomly drawn tree, 

cr 2 (x) = VarT(x; 0(Z)). (15.7) 

It is easy to confuse p(x) with the average correlation between fitted trees 
in a given random-forest ensemble; that is, think of the fitted trees as N- 
vectors, and compute the average pairwise correlation between these vec¬ 
tors, conditioned on the data. This is not the case; this conditional corre¬ 
lation is not directly relevant in the averaging process, and the dependence 
on x in p{x) warns us of the distinction. Rather, p(x) is the theoretical 
correlation between a pair of random-forest trees evaluated at x, induced 
by repeatedly making training sample draws Z from the population, and 
then drawing a pair of random forest trees. In statistical jargon, this is the 
correlation induced by the sampling distribution of Z and 0. 

More precisely, the variability averaged over in the calculations in (15.6) 
and (15.7) is both 
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• conditional on Z: due to the bootstrap sampling and feature sampling 
at each split, and 

• a result of the sampling variability of Z itself. 

In fact, the conditional covariance of a pair of tree fits at x is zero, because 
the bootstrap and feature sampling is i.i.d; see Exercise 15.5. 


co r 
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FIGURE 15.9. Correlations between pairs of trees drawn by a random-forest 
regression algorithm, as a function of m. The boxplots represent the correlations 
at 600 randomly chosen prediction points x. 


The following demonstrations are based on a simulation model 


Y = 


50 




x o +' 


i=i 


(15.8) 


with all the X 3 and e iid Gaussian. We use 500 training sets of size 100, and 
a single set of test locations of size 600. Since regression trees are nonlinear 
in Z, the patterns we see below will differ somewhat depending on the 
structure of the model. 

Figure 15.9 shows how the correlation (15.6) between pairs of trees de¬ 
creases as m decreases: pairs of tree predictions at x for different training 
sets Z are likely to be less similar if they do not use the same splitting 
variables. 

In the left panel of Figure 15.10 we consider the variances of single tree 
predictors, Var T(x; Q(Z)) (averaged over 600 prediction points x drawn 
randomly from our simulation model). This is the total variance, and can be 
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decomposed into two parts using standard conditional variance arguments 
(see Exercise 15.5): 

Vare,zT(x;0(Z)) = Var z E e |zT(x; 0(Z)) + E z Var e |z2>; 0(Z)) 

Total Variance = Vai'z/ r f(a:) + within-Z Variance 

_ (15-9) 

The second term is the within-Z variance—a result of the randomization, 
which increases as m decreases. The first term is in fact the sampling vari¬ 
ance of the random forest ensemble (shown in the right panel), which de¬ 
creases as m decreases. The variance of the individual trees does not change 
appreciably over much of the range of m, hence in light of (15.5), the vari¬ 
ance of the ensemble is dramatically lower than this tree variance. 


Single Tree 



Random Forest Ensemble 


Mean Squared Error 
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Variance 
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FIGURE 15.10. Simulation results. The left panel shows the average variance of 
a single random forest tree, as a function ofm. “Within Z” refers to the average 
within-sample contribution to the variance, resulting from the bootstrap sampling 
and split-variable sampling (15.9). “Total” includes the sampling variability of 
Z. The horizontal line is the average variance of a single fully grown tree (with¬ 
out bootstrap sampling). The right panel shows the average mean-squared error, 
squared bias and variance of the ensemble, as a function of m. Note that the 
variance axis is on the right (same scale, different level). The horizontal line is 
the average squared-bias of a fully grown tree. 


15.4.2 Bias 

As in bagging, the bias of a random forest is the same as the bias of any 
of the individual sampled trees T(x; 0(Z)): 
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Bias(z) = n(x) — E z / r f(a;) 

= M(*)-E z Ee|zr(x;0(Z)). (15.10) 

This is also typically greater (in absolute terms) than the bias of an un¬ 
pruned tree grown to Z, since the randomization and reduced sample space 
impose restrictions. Hence the improvements in prediction obtained by bag¬ 
ging or random forests are solely a result of variance reduction. 

Any discussion of bias depends on the unknown true function. Fig¬ 
ure 15.10 (right panel) shows the squared bias for our additive model simu¬ 
lation (estimated from the 500 realizations). Although for different models 
the shape and rate of the bias curves may differ, the general trend is that 
as m decreases, the bias increases. Shown in the figure is the mean-squared 
error, and we see a classical bias-variance trade-off in the choice of to. For 
all to the squared bias of the random forest is greater than that for a single 
tree (horizontal line). 

These patterns suggest a similarity with ridge regression (Section 3.4.1). 
Ridge regression is useful (in linear models) when one has a large number 
of variables with similarly sized coefficients; ridge shrinks their coefficients 
toward zero, and those of strongly correlated variables toward each other. 
Although the size of the training sample might not permit all the variables 
to be in the model, this regularization via ridge stabilizes the model and al¬ 
lows all the variables to have their say (albeit diminished). Random forests 
with small to perform a similar averaging. Each of the relevant variables 
get their turn to be the primary split, and the ensemble averaging reduces 
the contribution of any individual variable. Since this simulation exam¬ 
ple (15.8) is based on a linear model in all the variables, ridge regression 
achieves a lower mean-squared error (about 0.45 with df(A op t) ~ 29). 

15-4-3 Adaptive Nearest Neighbors 

The random forest classifier has much in common with the fc-nearest neigh¬ 
bor classifier (Section 13.3); in fact a weighted version thereof. Since each 
tree is grown to maximal size, for a particular 0*, T(x;0*(Z)) is the re¬ 
sponse value for one of the training samples 4 . The tree-growing algorithm 
finds an “optimal” path to that observation, choosing the most informative 
predictors from those at its disposal. The averaging process assigns weights 
to these training responses, which ultimately vote for the prediction. Hence 
via the random-forest voting mechanism, those observations close to the 
target point get assigned weights—an equivalent kernel -which combine to 
form the classification decision. 

Figure 15.11 demonstrates the similarity between the decision boundary 
of 3-nearest neighbors and random forests on the mixture data. 


4 We gloss over the fact that pure nodes are not split further, and hence there can be 
more than one observation in a terminal node 
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Random Forest Classifier 


3-Nearest Neighbors 




FIGURE 15.11. Random forests versus 3-NN on the mixture data. The axis-ori¬ 
ented nature of the individual trees in a random forest lead to decision regions 
with an axis-oriented flavor. 


Bibliographic Notes 

Random forests as described here were introduced by Breiman (2001), al¬ 
though many of the ideas had cropped up earlier in the literature in dif¬ 
ferent forms. Notably Ho (1995) introduced the term “random forest,” and 
used a consensus of trees grown in random subspaces of the features. The 
idea of using stochastic perturbation and averaging to avoid overfitting was 
introduced by Kleinberg (1990), and later in Kleinberg (1996). Amit and 
Geman (1997) used randomized trees grown on image features for image 
classification problems. Breiman (1996a) introduced bagging, a precursor 
to his version of random forests. Dietterich (2000b) also proposed an im¬ 
provement on bagging using additional randomization. His approach was 
to rank the top 20 candidate splits at each node, and then select from the 
list at random. He showed through simulations and real examples that this 
additional randomization improved over the performance of bagging. Fried¬ 
man and Hall (2007) showed that sub-sampling (without replacement) is 
an effective alternative to bagging. They showed that growing and aver¬ 
aging trees on samples of size N/2 is approximately equivalent (in terms 
bias/variance considerations) to bagging, while using smaller fractions of 
N reduces the variance even further (through decorrelation). 

There are several free software implementations of random forests. In 
this chapter we used the randomForest package in R, maintained by Andy 
Liaw, available from the CRAN website. This allows both split-variable se¬ 
lection, as well as sub-sampling. Adele Cutler maintains a random forest 
website http ://www. math, usu.edu/^adele/f orests/ where (as of Au¬ 
gust 2008) the software written by Leo Breiman and Adele Cutler is freely 
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available. Their code, and the name “random forests”, is exclusively li¬ 
censed to Salford Systems for commercial release. The Weka machine learn¬ 
ing archive http: //www. cs. waikato. ac. nz/ml/weka/ at Waikato Univer¬ 
sity, New Zealand, offers a free java implementation of random forests. 


Exercises 


Ex. 15.1 Derive the variance formula (15.1). This appears to fail if p is 
negative; diagnose the problem in this case. 

Ex. 15.2 Show that as the number of bootstrap samples B gets large, the 
OOB error estimate for a random forest approaches its IV-fold CV error 
estimate, and that in the limit, the identity is exact. 

Ex. 15.3 Consider the simulation model used in Figure 15.7 (Mease and 
Wyner, 2008). Binary observations are generated with probabilities 


Pr(E = l\X) 


q+(l-2q)-l 


J / 2 

j=i 


(15.11) 


where X ~ C7[0, l] p , 0 < q < 2 , and J < p is some predefined (even) 
number. Describe this probability surface, and give the Bayes error rate. 

Ex. 15.4 Suppose Xj, i = 1,..., N are iid (/r,cr 2 ). Let x^ and x\ be two 
bootstrap realizations of the sample mean. Show that the sampling cor¬ 
relation corral,x?;) = 2 n-i ~ 50%. Along the way, derive var(x*) and 
the variance of the bagged mean Xb ag . Here x is a linear statistic; bagging 
produces no reduction in variance for linear statistics. 


Ex. 15.5 Show that the sampling correlation between a pair of random- 
forest trees at a point x is given by 


Var z [Ee| Z T(x;0(Z))] 

Var z [Ee|zT(x; 0(Z))] + EzVar 0 |z[T(x; 0(Z)]' 


(15.12) 


The term in the numerator is Varz[/ r f(x)], and the second term in the 
denominator is the expected conditional variance due to the randomization 
in random forests. 


Ex. 15.6 Fit a series of random-forest classifiers to the spam data, to explore 
the sensitivity to the parameter m. Plot both the OOB error as well as the 
test error against a suitably chosen range of values for m. 
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Ex. 15.7 Suppose we fit a linear regression model to N observations with 
response yt and predictors Xu ,..., Xi P . Assume that all variables are stan¬ 
dardized to have mean zero and standard deviation one. Let RSS be the 
mean-squared residual on the training data, and j3 the estimated coefficient. 
Denote by RSS* the mean-squared residual on the training data using the 

same /3, but with the N values for the jth variable randomly permuted 
before the predictions are calculated. Show that 

Ep [RSS* - RSS] = 2/3|, (15.13) 

where Ep denotes expectation with respect to the permutation distribution. 
Argue that this is approximately true when the evaluations are done using 
an independent test set. 
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16 

Ensemble Learning 


16.1 Introduction 

The idea of ensemble learning is to build a prediction model by combining 
the strengths of a collection of simpler base models. We have already seen 
a number of examples that fall into this category. 

Bagging in Section 8.7 and random forests in Chapter 15 are ensemble 
methods for classification, where a committee of trees each cast a vote for 
the predicted class. Boosting in Chapter 10 was initially proposed as a 
committee method as well, although unlike random forests, the committee 
of weak learners evolves over time, and the members cast a weighted vote. 
Stacking (Section 8.8) is a novel approach to combining the strengths of 
a number of fitted models. In fact one could characterize any dictionary 
method, such as regression splines, as an ensemble method, with the basis 
functions serving the role of weak learners. 

Bayesian methods for nonparametric regression can also be viewed as 
ensemble methods: a large number of candidate models are averaged with 
respect to the posterior distribution of their parameter settings (e.g. (Neal 
and Zhang, 2006)). 

Ensemble learning can be broken down into two tasks: developing a pop¬ 
ulation of base learners from the training data, and then combining them 
to form the composite predictor. In this chapter we discuss boosting tech¬ 
nology that goes a step further; it builds an ensemble model by conducting 
a regularized and supervised search in a high-dimensional space of weak 
learners. 
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An early example of a learning ensemble is a method designed for multi¬ 
class classification using error-correcting output codes (Dietterich and Bakiri, 
1995, ECOC). Consider the 10-class digit classification problem, and the 
coding matrix C given in Table 16.1. 

TABLE 16.1. Part of a 15-bit error-correcting coding matrix C for the 10-class 
digit classification problem. Each column defines a two-class classification prob¬ 
lem. 


Digit 

Cl 

c 2 

C 3 

c 4 

C 5 

C 6 

Ci5 

0 

1 

1 

0 

0 

0 

0 

1 

1 

0 

0 

1 

1 

1 

1 

0 

2 

1 

0 

0 

1 

0 

0 

1 

8 

1 

1 

0 

1 

0 

1 

1 

9 

0 

1 

1 

1 

0 

0 

0 


Note that the £th column of the coding matrix Ce defines a two-class 
variable that merges all the original classes into two groups. The method 
works as follows: 

1. Learn a separate classifier for each of the L = 15 two class problems 
defined by the columns of the coding matrix. 

2. At a test point x , let pt(x) be the predicted probability of a one for 
the £t \i response. 

3. Define 6 k{x) = | Cm — Pe(x)\, the discriminant function for the 

/cth class, where Cm is the entry for row k and column i in Table 16.1. 

Each row of C is a binary code for representing that class. The rows have 
more bits than is necessary, and the idea is that the redundant “error- 
correcting” bits allow for some inaccuracies, and can improve performance. 
In fact, the full code matrix C above has a minimum Hamming distance 1 
of 7 between any pair of rows. Note that even the indicator response coding 
(Section 4.2) is redundant, since 10 classes require only flog 2 10 = 4 bits for 
their unique representation. Dietterich and Bakiri (1995) showed impressive 
improvements in performance for a variety of multiclass problems when 
classification trees were used as the base classifier. 

James and Hastie (1998) analyzed the ECOC approach, and showed 
that random code assignment worked as well as the optimally constructed 
error-correcting codes. They also argued that the main benefit of the coding 
was in variance reduction (as in bagging and random forests), because the 
different coded problems resulted in different trees, and the decoding step 
(3) above has a similar effect as averaging. 


1 The Hamming distance between two vectors is I he number of mismatches between 
corresponding entries. 
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16.2 Boosting and Regularization Paths 

In Section 10.12.2 of the first edition of this book, we suggested an analogy 
between the sequence of models produced by a gradient boosting algorithm 
and regularized model fitting in high-dimensional feature spaces. This was 
primarily motivated by observing the close connection between a boosted 
version of linear regression and the lasso (Section 3.4.2). These connec¬ 
tions have been pursued by us and others, and here we present our current 
thinking in this area. We start with the original motivation, which fits more 
naturally in this chapter on ensemble learning. 


16.2.1 Penalized Regression 

Intuition for the success of the shrinkage strategy (10.41) of gradient boost¬ 
ing (page 364 in Chapter 10) can be obtained by drawing analogies with 
penalized linear regression with a large basis expansion. Consider the dic¬ 
tionary of all possible J-terminal node regression trees T = {T*,} that could 
be realized on the training data as basis functions in IR P . The linear model 
is 

K 

f( x ) = ^2 a kTk(x), (16.1) 

k =1 

where K = card(T). Suppose the coefficients are to be estimated by least 
squares. Since the number of such trees is likely to be much larger than 
even the largest training data sets, some form of regularization is required. 
Let a (A) solve 


[ N 1 

' k \ 2 


m j n \ E 

yi~'^2,a k T k (xi)\ + A • J(a) 

(16.2) 

[ 2=1 \ 

. fc=i ) J 


J(a ) is a function of the coefficients that generally penalizes larger values. 
Examples are 

K 

J(a) = 

\ctk\ 2 ridge regression, 

k= 1 

K 

(16.3) 

J( a ) = 

y^|a/c| lasso, 

k= 1 

(16.4) 

(16.5) 


both covered in Section 3.4. As discussed there, the solution to the lasso 
problem with moderate to large A tends to be sparse; many of the oRA) = 
0. That is, only a small fraction of all possible trees enter the model (16.1). 
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Algorithm 16.1 Forward Stagewise Linear Regression. 

1. Initialize dk = 0, k = 1,, K. Set e > 0 to some small constant, 
and M large. 

2. For m = 1 to M: 

(a) (/3*,fc*) = argmin^fc^Li (vi ~ Y.b=i & l T l{ x i) ~ P T k{xi)} . 

(b) d fe . <- d fc . + e • sign(/3*). 

3. Output f M (x) = J2k=i dfcTfc( x). 


This seems reasonable since it is likely that only a small fraction of all pos¬ 
sible trees will be relevant in approximating any particular target function. 
However, the relevant subset will be different for different targets. Those 
coefficients that are not set to zero are shrunk by the lasso in that their 
absolute values are smaller than their corresponding least squares values 2 : 
|dfc(A)| < |dfe(0)|. As A increases, the coefficients all shrink, each one 
ultimately becoming zero. 

Owing to the very large number of basis functions T)., directly solving 
(16.2) with the lasso penalty (16.4) is not possible. However, a feasible 
forward stagewise strategy exists that closely approximates the effect of 
the lasso, and is very similar to boosting and the forward stagewise Algo¬ 
rithm 10.2. Algorithm 16.1 gives the details. Although phrased in terms 
of tree basis functions T*,, the algorithm can be used with any set of ba¬ 
sis functions. Initially all coefficients are zero in line 1; this corresponds 
to A = oo in (16.2). At each successive step, the tree T/-» is selected that 
best fits the current residuals in line 2(a). Its corresponding coefficient dk * 
is then incremented or decremented by an infinitesimal amount in 2(b), 
while all other coefficients dk , k ^ k* are left unchanged. In principle, this 
process could be iterated until either all the residuals are zero, or /3* = 0. 
The latter case can occur if K < N, and at that point the coefficient values 
represent a least squares solution. This corresponds to A = 0 in (16.2). 

After applying Algorithm 16.1 with M < oo iterations, many of the coef¬ 
ficients will be zero, namely, those that have yet to be incremented. The oth¬ 
ers will tend to have absolute values smaller than their corresponding least 
squares solution values, \dk{M) \ < |dfc(0)|. Therefore this M-iteration 
solution qualitatively resembles the lasso, with M inversely related to A. 

Figure 16.1 shows an example, using the prostate data studied in Chap¬ 
ter 3. Here, instead of using trees Tk(X) as basis functions, we use the origi- 


2 If K > N, there is in general no unique “least squares value,” since infinitely many 
solutions will exist that fit the data perfectly. We can pick the minimum Li-norm solution 
amongst these, which is the unique lasso solution. 
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Lasso 


Forward Stagewise 



FIGURE 16.1. Profiles of estimated coefficients from linear regression, for the 
prostate data studied in Chapter 3. The left panel shows the results from the lasso, 
for different values of the bound parameter t — 'Yf lk \otk\- The right panel shows 
the results of the stagewise linear regression Algorithm 16.1, using M = 220 
consecutive steps of size e = .01. 


nal variables X/~ themselves; that is, a multiple linear regression model. The 
left panel displays the profiles of estimated coefficients from the lasso, for 
different values of the bound parameter t = |«fc|. The right panel shows 

the results of the stagewise Algorithm 16.1, with M = 250 and e = 0.01. 
[The left and right panels of Figure 16.1 are the same as Figure 3.10 and 
the left panel of Figure 3.19, respectively.] The similarity between the two 
graphs is striking. 

In some situations the resemblance is more than qualitative. For example, 
if all of the basis functions Tk are mutually uncorrelated, then as e 0, M f 
such that Me —> t, Algorithm 16.1 yields exactly the same solution as the 
lasso for bound parameter t = ^2 k \otk\ (and likewise for all solutions along 
the path). Of course, tree-based regressors are not uncorrelated. However, 
the solution sets are also identical if the coefficients dfc(A) are all monotone 
functions of A. This is often the case when the correlation between the 
variables is low. When the otk{ A) are not monotone in A, then the solution 
sets are not identical. The solution sets for Algorithm 16.1 tend to change 
less rapidly with changing values of the regularization parameter than those 
of the lasso. 
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Efron et al. (2004) make the connections more precise, by characterizing 
the exact solution paths in the £-limiting case. They show that the coeffi¬ 
cient paths are piece-wise linear functions, both for the lasso and forward 
stagewise. This facilitates efficient algorithms which allow the entire paths 
to be computed with the same cost as a single least-squares fit. This least 
angle regression algorithm is described in more detail in Section 3.8.1. 

Hastie et al. (2007) show that this infinitesimal forward stagewise algo¬ 
rithm (FSo) fits a monotone version of the lasso, which optimally reduces 
at each step the loss function for a given increase in the arc length of the 
coefficient path (see Sections 16.2.3 and 3.8.1). The arc-length for the e > 0 
case is Me, and hence proportional to the number of steps. 

Tree boosting (Algorithm 10.3) with shrinkage (10.41) closely resembles 
Algorithm 16.1, with the learning rate parameter v corresponding to e. For 
squared error loss, the only difference is that the optimal tree to be selected 
at each iteration T is approximated by the standard top-down greedy 
tree-induction algorithm. For other loss functions, such as the exponential 
loss of AdaBoost and the binomial deviance, Rosset et al. (2004a) show 
similar results to what we see here. Thus, one can view tree boosting with 
shrinkage as a form of monotone ill-posed regression on all possible (J- 
terminal node) trees, with the lasso penalty (16.4) as a regularizer. We 
return to this topic in Section 16.2.3. 

The choice of no shrinkage [u = 1 in equation (10.41)] is analogous to 
forward-stepwise regression, and its more aggressive cousin best-subset se¬ 
lection, which penalizes the number of non zero coefficients J(a) = |a/c|°- 

With a small fraction of dominant variables, best subset approaches often 
work well. But with a moderate fraction of strong variables, it is well known 
that subset selection can be excessively greedy (Copas, 1983), often yielding 
poor results when compared to less aggressive strategies such as the lasso 
or ridge regression. The dramatic improvements often seen when shrinkage 
is used with boosting are yet another confirmation of this approach. 

16.2.2 The “Bet on Sparsity” Principle 

As shown in the previous section, boosting’s forward stagewise strategy 
with shrinkage approximately minimizes the same loss function with a 
lasso-style L\ penalty. The model is built up slowly, searching through 
“model space” and adding shrunken basis functions derived from impor¬ 
tant predictors. In contrast, the L 2 penalty is computationally much easier 
to deal with, as shown in Section 12.3.7. With the basis functions and L 2 
penalty chosen to match a particular positive-definite kernel, one can solve 
the corresponding optimization problem without explicitly searching over 
individual basis functions. 

However, the sometimes superior performance of boosting over proce¬ 
dures such as the support vector machine may be largely due to the im¬ 
plicit use of the L\ versus L 2 penalty. The shrinkage resulting from the 
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L\ penalty is better suited to sparse situations, where there are few basis 
functions with nonzero coefficients (among all possible choices). 

We can strengthen this argument through a simple example, taken from 
Friedman et al. (2004). Suppose we have 10,000 data points and our model 
is a linear combination of a million trees. If the true population coefficients 
of these trees arose from a Gaussian distribution, then we know that in a 
Bayesian sense the best predictor is ridge regression (Exercise 3.6). That is, 
we should use an L 2 rather than an L\ penalty when fitting the coefficients. 
On the other hand, if there are only a small number (e.g., 1000) coefficients 
that are nonzero, the lasso (L 1 penalty) will work better. We think of this 
as a sparse scenario, while the first case (Gaussian coefficients) is dense. 
Note however that in the dense scenario, although the L 2 penalty is best, 
neither method does very well since there is too little data from which to 
estimate such a large number of nonzero coefficients. This is the curse of 
dimensionality taking its toll. In a sparse setting, we can potentially do 
well with the L\ penalty, since the number of nonzero coefficients is small. 
The L 2 penalty fails again. 

In other words, use of the L\ penalty follows what we call the “bet on 
sparsity” principle for high-dimensional problems: 

Use a procedure that does well in sparse problems, since no pro¬ 
cedure does well in dense problems. 

These comments need some qualification: 

• For any given application, the degree of sparseness/denseness depends 
on the unknown true target function, and the chosen dictionary T. 

• The notion of sparse versus dense is relative to the size of the train¬ 
ing data set and/or the noise-to-signal ratio (NSR). Larger training 
sets allow us to estimate coefficients with smaller standard errors. 
Likewise in situations with small NSR, we can identify more nonzero 
coefficients with a given sample size than in situations where the NSR 
is larger. 

• The size of the dictionary plays a role as well. Increasing the size of the 
dictionary may lead to a sparser representation for our function, but 
the search problem becomes more difficult leading to higher variance. 

Figure 16.2 illustrates these points in the context of linear models us¬ 
ing simulation. We compare ridge regression and lasso, both for classifi¬ 
cation and regression problems. Each run has 50 observations with 300 
independent Gaussian predictors. In the top row all 300 coefficients are 
nonzero, generated from a Gaussian distribution. In the middle row, only 
10 are nonzero and generated from a Gaussian, and the last row has 30 
non zero Gaussian coefficients. For regression, standard Gaussian noise is 


Percentage Squared Prediction Error Explained 
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Regression 

Lasso/Gaussian Ridge/Gaussian 


0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 


Lasso/Subset 10 Ridge/Subset 10 



Lasso/Subset 30 Ridge/Subset 30 



Classification 

Lasso/Gaussian Ridge/Gaussian 


- 


- 


0.1 0.2 0.3 0.4 0.5 

0.1 0.2 0.3 0.4 0.5 

Lasso/Subset 10 

Ridge/Subset 10 



lib 


0.1 0.2 0.3 0.4 0.5 

0.1 0.2 0.3 0.4 0.5 

Lasso/Subset 30 

Ridge/Subset 30 


™ -r. ^ 


0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 

Noise-to-Signal Ratio 


FIGURE 16.2. Simulations that show the superiority of the L\ (lasso) penalty 
over 1/2 (ridge) in regression and classification. Each run has 50 observations 
with 300 independent Gaussian predictors. In the top row all 300 coefficients are 
nonzero, generated from a Gaussian distribution. In the middle row, only 10 are 
nonzero, and the last row has 30 nonzero. Gaussian errors are added to the linear 
predictor r/(X) for the regression problems, and binary responses generated via the 
inverse-logit transform for the classification problems. Scaling ofr/(X) resulted in 
the noise-to-signal ratios shown. Lasso is used in the left sub-columns, ridge in the 
right. We report the optimal percentage of error explained on test data (relative 
to the error of a constant model), displayed as boxplots over 20 realizations for 
each combination. In the only situation where ridge beats lasso (top row), neither 
do well. 
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added to the linear predictor rj(X) = X T /3 to produce a continuous re¬ 
sponse. For classification the linear predictor is transformed via the inverse- 
logit to a probability, and a binary response is generated. Five differ¬ 
ent noise-to-signal ratios are presented, obtained by scaling g{X) prior 
to generating the response. In both cases this is defined to be NSR = 
Var(y|?7(X))/Var(77(X)). Both the ridge regression and lasso coefficient 
paths were fit using a series of 50 values of A corresponding to a range of 
df from 1 to 50 (see Chapter 3 for details). The models were evaluated on 
a large test set (infinite for Gaussian, 5000 for binary), and in each case the 
value for A was chosen to minimize the test-set error. We report percentage 
variance explained for the regression problems, and percentage misclassifi- 
cation error explained for the classification problems (relative to a baseline 
error of 0.5). There are 20 simulation runs for each scenario. 

Note that for the classification problems, we are using squared-error loss 
to fit the binary response. Note also that we do not using the training 
data to select A, but rather are reporting the best possible behavior for 
each method in the different scenarios. The L 2 penalty performs poorly 
everywhere. The Lasso performs reasonably well in the only two situations 
where it can (sparse coefficients). As expected the performance gets worse 
as the NSR increases (less so for classification), and as the model becomes 
denser. The differences are less marked for classification than for regression. 

These empirical results are supported by a large body of theoretical 
results (Donoho and Johnstone, 1994; Donoho and Elad, 2003; Donoho, 
2006b; Candes and Tao, 2007) that support the superiority of L\ estimation 
in sparse settings. 

16.2.3 Regularization Paths, Over-fitting and Margins 

It has often been observed that boosting “does not overfit,” or more as¬ 
tutely is “slow to overfit.” Part of the explanation for this phenomenon was 
made earlier for random forests — misclassification error is less sensitive to 
variance than is mean-squared error, and classification is the major focus 
in the boosting community. In this section we show that the regulariza¬ 
tion paths of boosted models are “well behaved,” and that for certain loss 
functions they have an appealing limiting form. 

Figure 16.3 shows the coefficient paths for lasso and infinitesimal forward 
stagewise (FS 0 ) in a simulated regression setting. The data consists of a 
dictionary of 1000 Gaussian variables, strongly correlated (p = 0.95) within 
blocks of 20, but uncorrelated between blocks. The generating model has 
nonzero coefficients for 50 variables, one drawn from each block, and the 
coefficient values are drawn from a standard Gaussian. Finally, Gaussian 
noise is added, with a noise-to-signal ratio of 0.72 (Exercise 16.1.) The 
FSo algorithm is a limiting form of algorithm 16.1, where the step size e 
is shrunk to zero (Section 3.8.1). The grouping of the variables is intended 
to mimic the correlations of nearby trees, and with the forward-stagewise 



614 


16. Ensemble Learning 


LASSO 


Forward Stagewise 




0.2 0.4 0.6 0.8 

|a(m)|/|a(oo)| 


FIGURE 16.3. Comparison of lasso and infinitesimal forward stagewise paths 
on simulated regression data. The number of samples is 60 and the number of 
variables is 1000. The forward-stagewise paths fluctuate less than those of lasso 
in the final stages of the algorithms. 

algorithm, this setup is intended as an idealized version of gradient boosting 
with shrinkage. For both these algorithms, the coefficient paths can be 
computed exactly, since they are piecewise linear (see the LARS algorithm 
in Section 3.8.1). 

Here the coefficient profiles are similar only in the early stages of the 
paths. For the later stages, the forward stagewise paths tend to be mono¬ 
tone and smoother, while those for the lasso fluctuate widely. This is due 
to the strong correlations among subsets of the variables —lasso suffers 
somewhat from the multi-collinearity problem (Exercise 3.28). 

The performance of the two models is rather similar (Figure 16.4), and 
they achieve about the same minimum. In the later stages forward stagewise 
takes longer to overfit, a likely consequence of the smoother paths. 

Hastie et al. (2007) show that FSo solves a monotone version of the lasso 
problem for squared error loss. Let T a = T U {—T} be the augmented 
dictionary obtained by including a negative copy of every basis element 
in T. We consider models /( x) = X^T fc eT a a kTk{x) with non-negative co¬ 
efficients ak > 0. In this expanded space, the lasso coefficient paths are 
positive, while those of FSo are monotone nondecreasing. 

The monotone lasso path is characterized by a differential equation 

= p ml (a(£)), 


da 

~dl 


(16.6) 
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FIGURE 16.4. Mean squared error for lasso and infinitesimal forward stagewise 
on the simulated data. Despite the difference in the coefficient paths, the two 
models perform similarly over the critical part of the regularization path. In the 
right tail, lasso appears to overfit more rapidly. 

with initial condition a(0) = 0, where l is the L\ arc-length of the path 
a(£) (Exercise 16.2). The monotone lasso move direction (velocity vector) 
p ml (a(£)) decreases the loss at the optimal quadratic rate per unit increase 
in the L\ arc-length of the path. Since p™ l (ct(£)) > 0 Vk,£, the solution 
paths are monotone. 

The lasso can similarly be characterized as the solution to a differential 
equation as in (16.6), except that the move directions decrease the loss 
optimally per unit increase in the L i norm of the path. As a consequence, 
they are not necessarily positive, and hence the lasso paths need not be 
monotone. 

In this augmented dictionary, restricting the coefficients to be positive is 
natural, since it avoids an obvious ambiguity. It also ties in more naturally 
with tree boosting—we always find trees positively correlated with the 
current residual. 

There have been suggestions that boosting performs well (for two-class 
classification) because it exhibits maximal-margin properties, much like the 
support-vector machines of Chapters 4.5.2 and 12. Schapire et al. (1998) 
define the normalized L\ margin of a fitted model f(x) = ]T) fc akTk(x) as 



(16.7) 


Here the minimum is taken over the training sample, and y, £ {—1, -1-1}. 
Unlike the L 2 margin (4.40) of support vector machines, the L\ margin 
m(f) measures the distance to the closest training point in units (max¬ 
imum coordinate distance). 
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FIGURE 16.5. The left panel shows the L\ margin m(f) for the Adaboost clas¬ 
sifier on the mixture data, as a function of the number of 4-node trees. The model 
was fit using the R package gbm, with a shrinkage factor of 0.02. After 10,000 
trees, m(f) has settled down. Note that when the margin crosses zero, the training 
error becomes zero. The right panel shows the test error, which is minimized at 
240 trees. In this case, Adaboost overfits dramatically if run to convergence. 

Schapire et al. (1998) prove that with separable data, Adaboost in¬ 
creases m(/) with each iteration, converging to a margin-symmetric so¬ 
lution. Ratsch and Warmuth (2002) prove the asymptotic convergence of 
Adaboost with shrinkage to a Ti-margin-maximizing solution. Rosset et 
al. (2004a) consider regularized models of the form (16.2) for general loss 
functions. They show that as A f 0, for particular loss functions the solution 
converges to a margin-maximizing configuration. In particular they show 
this to be the case for the exponential loss of Adaboost, as well as binomial 
deviance. 

Collecting together the results of this section, we reach the following 
summary for boosted classifiers: 

The sequence of boosted classifiers form an L\-regularized mono¬ 
tone path to a margin-maximizing solution. 

Of course the margin-maximizing end of the path can be a very poor, overfit 
solution, as it is in the example in Figure 16.5. Early stopping amounts 
to picking a point along the path, and should be done with the aid of a 
validation dataset. 


16.3 Learning Ensembles 

The insights learned from the previous sections can be harnessed to produce 
a more effective and efficient ensemble model. Again we consider functions 
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of the form 

f(x) = a 0 + ^2 a kTk{x) : (16.8) 

AeT 

where T is a dictionary of basis functions, typically trees. For gradient 
boosting and random forests, \T\ is very large, and it is quite typical for the 
final model to involve many thousands of trees. In the previous section we 
argue that gradient boosting with shrinkage fits an L\ regularized monotone 
path in this space of trees. 

Friedman and Popescu (2003) propose a hybrid approach which breaks 
this process down into two stages: 

• A finite dictionary 71 = {Ti(x),T2(x), ... ,Tm{x)} of basis functions 
is induced from the training data; 

• A family of functions f\(x) is built by fitting a lasso path in this 
dictionary: 


N M M 

a(A) = arg min E L'lyi: o:o A 'y ' d m T' m (xj)] A A y ( |d m |. (16.9) 

i= 1 m= 1 m= 1 

Iii its simplest form this model could be seen as a way of post-processing 
boosting or random forests, taking for Tl the collection of trees produced 
by the gradient boosting or random forest algorithms. By fitting the lasso 
path to these trees, we would typically use a much reduced set, which would 
save in computations and storage for future predictions. In the next section 
we describe modifications of this prescription that reduce the correlations in 
the ensemble 7 l , and improve the performance of the lasso post processor. 

As an initial illustration, we apply this procedure to a random forest 
ensemble grown on the spam data. 

Figure 16.6 shows that a lasso post-processing offers modest improve¬ 
ment over the random forest (blue curve), and reduces the forest to about 
40 trees, rather than the original 1000. The post-processed performance 
matches that of gradient boosting. The orange curves represent a modified 
version of random forests, designed to reduce the correlations between trees 
even more. Here a random sub-sample (without replacement) of 5% of the 
training sample is used to grow each tree, and the trees are restricted to be 
shallow (about six terminal nodes). The post-processing offers more dra¬ 
matic improvements here, and the training costs are reduced by a factor 
of about 100. However, the performance of the post-processed model falls 
somewhat short of the blue curves. 

16.3.1 Learning a Good Ensemble 

Not all ensembles 7 ~l will perform well with post-processing. In terms of 
basis functions, we want a collection that covers the space well in places 
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Spam Data 



FIGURE 16.6. Application of the lasso post-processing (16.9) to the spam data. 
The horizontal blue line is the test error of a random forest fit to the spam data, 
using 1000 trees grown to maximum depth (with m = 7; see Algorithm 15.1). 
The jagged blue curve is the test error after post-processing the first 500 trees 
using the lasso, as a function of the number of trees with nonzero coefficients. 
The orange curve/line use a modified form of random forest, where a random 
draw of 5% of the data are used to grow each tree, and the trees are forced to 
be shallow (typically six terminal nodes). Here the post-processing offers much 
greater improvement over the random forest that generated the ensemble. 


where they are needed, and are sufficiently different from each other for 
the post-processor to be effective. 

Friedman and Popescu (2003) gain insights from numerical quadrature 
and importance sampling. They view the unknown function as an integral 

f{x) = / ^( 7 ) 6 ( 337 )^ 7 , (16.10) 

where 7 £ T indexes the basis functions b(x; 7 ). For example, if the basis 
functions are trees, then 7 indexes the splitting variables, the split-points 
and the values in the terminal nodes. Numerical quadrature amounts to 
finding a set of M evaluation points 7 m £ T and corresponding weights 
a m so that = ao + J2m=i a mb(x ; 7 m ) approximates /( x) well over 

the domain of x. Importance sampling amounts to sampling 7 at random, 
but giving more weight to relevant regions of the space T. Friedman and 
Popescu (2003) suggest a measure of (lack of) relevance that uses the loss 
function (16.9): 
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Q( 7) 


N 

min Y] L(yi, c 0 + Ci&fo; 7 )), 

Co,Cl ' 

2=1 


(16.11) 


evaluated on the training data. 

If a single basis function were to be selected (e.g., a tree), it would be 
the global minimizer 7 * = argmin 7g r Q(l)- Introducing randomness in the 
selection of 7 would necessarily produce less optimal values with < 5 ( 7 ) > 
< 5 ( 7 *)- They propose a natural measure of the characteristic width a of the 
sampling scheme S, 

<7 = E s [Q( 7 )-Q( 7 *)]. (16.12) 

• cr too narrow suggests too many of the b(x; i y m ) look alike, and similar 
to &(z; 7 *); 

• a too wide implies a large spread in the but possibly con¬ 

sisting of many irrelevant cases. 

Friedman and Popescu (2003) use sub-sampling as a mechanism for intro¬ 
ducing randomness, leading to their ensemble-generation algorithm 16.2. 


Algorithm 16.2 ISLE Ensemble Generation. 

1 - fo(x) = argmin c YliLi L iVu c ) 

2. For m = 1 to M do 

(a) 7 m =&rgmm 7 J2ies m (ri) L (yiifrn-i{ x i) + Kx i ;j)) 

(b) f m (x) = fm-l(x) + ub(x] 7 m ) 

3 . Tisle = {b( x ', 71 ), b(x; 72 ),..., b(x\ 7m)}- 


Sm(i ?) refers to a subsample of N ■ r\ (?) e (0,1]) of the training obser¬ 
vations, typically without replacement. Their simulations suggest picking 
r) < 2 ; and for large N picking r] ~ 1/y/N. Reducing 77 increases the 
randomness, and hence the width a. The parameter v £ [0,1] introduces 
memory into the randomization process; the larger 17 the more the pro¬ 
cedure avoids b(x\ 7 ) similar to those found before. A number of familiar 
randomization schemes are special cases of Algorithm 16.2: 

Bagging has 77 = 1, but samples with replacement, and has v = 0. Fried¬ 
man and Hall (2007) argue that sampling without replacement with 
77 = 1/2 is equivalent to sampling with replacement with 77 = 1 , and 
the former is much more efficient. 
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Random forest sampling is similar, with more randomness introduced by 
the selection of the splitting variable. Reducing 77 < 1/2 in algo¬ 
rithm 16.2 has a similar effect to reducing m in random forests, but 
does not suffer from the potential biases discussed in Section 15.4.2. 

Gradient boosting with shrinkage (10.41) uses 77 = 1, but typically does 
not produce sufficient width a. 

Stochastic gradient boosting (Friedman, 1999) follows the recipe exactly. 

The authors recommend values v — 0.1 and 77 < ), and call their combined 
procedure (ensemble generation and post processing) Importance sampled 
learning ensemble (ISLE). 

Figure 16.7 shows the performance of an ISLE on the spam data. It does 


Spam Data 



FIGURE 16.7. Importance sampling learning ensemble (ISLE) fit to the spam 
data. Here we used 77 = 1/2, v = 0.05, and trees with five terminal nodes. The 
lasso post-processed ensemble does not improve the prediction error in this case, 
but it reduces the number of trees by a factor of five. 

not improve the predictive performance, but is able to produce a more 
parsimonious model. Note that in practice the post-processing includes 
the selection of the regularization parameter A in (16.9), which would be 
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chosen by cross-validation. Here we simply demonstrate the effects of post¬ 
processing by showing the entire path on the test data. 

Figure 16.8 shows various ISLEs on a regression example. The generating 



FIGURE 16.8. Demonstration of ensemble methods on a regression simulation 
example. The notation GBM (0.1, 0.01) refers to a gradient boosted model, with 
parameters (rj, v). We report mean-squared error from the true (known) function. 
Note that the sub-sampled GBM model (green) outperforms the full GBM model 
(orange). The lasso post-processed version achieves similar error. The random 
forest is outperformed by its post-processed version, but both fall short of the 
other models. 


function is 


/(X) = 10 H e~ 2X ? 

3 =1 


35 


+ ^2 X r. 

j =6 


(16.13) 


where X ~ C/[0,1] 100 (the last 65 elements are noise variables). The re¬ 
sponse Y = f(X) + e where £ ~ X(0,cr 2 ); we chose a = 1.3 resulting in a 
signal-to-noise ratio of approximately 2. We used a training sample of size 
1000, and estimated the mean squared error E(/(X) — /(X)) 2 by averaging 
over a test set of 500 samples. The sub-sampled GBM curve (light blue) 
is an instance of stochastic gradient boosting (Friedman, 1999) discussed in 
Section 10.12, and it outperforms gradient boosting on this example. 
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16.3.2 Rule Ensembles 

Here we describe a modification of the tree-ensemble method that focuses 
on individual rules (Friedman and Popescu, 2003). We encountered rules 
in Section 9.3 in the discussion of the PRIM method. The idea is to enlarge 
an ensemble of trees by constructing a set of rules from each of the trees 
in the collection. 



FIGURE 16.9. A typical tree in an ensemble, from which rules can be derived. 


Figure 16.9 depicts a small tree, with numbered nodes. The following 
rules can be derived from this tree: 


Ri(X) 
Ra(X) 
Rs(X) 
R4 X) 
Rs(X) 

R*( X) 


I(Xi < 2.1) 

I(Xi > 2.1) 

7(Xi > 2.1) • I{X 3 € {S'}) 

I(X i > 2.1) -I(X 3 € {M,L}) 

I(X i > 2.1) • 7(X 3 € {S}) ■ I(X 7 < 4.5) 
I(Xi > 2.1) • I(X 3 6 {S}) ■ I(X 7 > 4.5) 


(16.14) 


A linear expansion in rules 1, 4, 5 and 6 is equivalent to the tree itself 
(Exercise 16.3); hence (16.14) is an over-complete basis for the tree. 

For each tree T m in an ensemble 7", we can construct its mini-ensemble 
of rules Tf"; LE , and then combine them all to form a larger ensemble 


M 

7)tULE = LE - 

m— 1 


(16.15) 


This is then treated like any other ensemble, and post-processed via the 
lasso or similar regularized procedure. 

There are several advantages to this approach of deriving rules from the 
more complex trees: 

• The space of models is enlarged, and can lead to improved perfor¬ 


mance. 
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Rules + Linear 


FIGURE 16.10. Mean squared error for rule ensembles, using 20 realizations 
of the simulation example (16.13). 

• Rules are easier to interpret than trees, so there is the potential for 
a simplified model. 

• It is often natural to augment T rule by including each variable X 3 
separately as well, thus allowing the ensemble to model linear func¬ 
tions well. 

Friedman and Popescu (2008) demonstrate the power of this procedure on a 
number of illustrative examples, including the simulation example (16.13). 
Figure 16.10 shows boxplots of the mean-squared error from the true model 
for twenty realizations from this model. The models were all fit using the 
Rulefit software, available on the ESL homepage 3 , which runs in an auto¬ 
matic mode. 

On the same training set as used in Figure 16.8, the rule based model 
achieved a mean-squared error of 1.06. Although slightly worse than the 
best achieved in that figure, the results are not comparable because cross- 
validation was used here to select the final model. 


Bibliographic Notes 

As noted in the introduction, many of the new methods in machine learning 
have been dubbed “ensemble” methods. These include neural networks 
boosting, bagging and random forests; Dietterich (2000a) gives a survey of 
tree-based ensemble methods. Neural networks (Chapter 11) are perhaps 
more deserving of the name, since they simultaneously learn the parameters 


3 ESL homepage: www-stat.stanford.edu/ElemStatLearn 
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of the hidden units (basis functions), along with how to combine them. 
Bishop (2006) discusses neural networks in some detail, along with the 
Bayesian perspective (MacKay, 1992; Neal, 1996). Support vector machines 
(Chapter 12) can also be regarded as an ensemble method; they perform 
1/2 regularized model fitting in high-dimensional feature spaces. Boosting 
and lasso exploit sparsity through L\ regularization to overcome the high- 
dimensionality, while SVMs rely on the “kernel trick” characteristic of L 2 
regularization. 

C5.0 (Quinlan, 2004) is a commercial tree and rule generation package, 
with some goals in common with Rulefit. 

There is a vast and varied literature often referred to as “combining clas¬ 
sifiers” which abounds in ad-hoc schemes for mixing methods of different 
types to achieve better performance. For a principled approach, see Kittler 
et al. (1998). 

Exercises 

Ex. 16.1 Describe exactly how to generate the block correlated data used 
in the simulation in Section 16.2.3. 

Ex. 16.2 Let a(t) £ 1R P be a piecewise-differentiable and continuous coef¬ 
ficient profile, with a(0) = 0. The L\ arc-length of a from time 0 to t is 
defined by 



(16.16) 


Show that A (t) > |a(f)|i, with equality iff a(t) is monotone. 

Ex. 16.3 Show that fitting a linear regression model using rules 1, 4, 5 and 
6 in equation (16.14) gives the same fit as the regression tree corresponding 
to this tree. Show the same is true for classification, if a logistic regression 
model is fit. 

Ex. 16.4 Program and run the simulation study described in Figure 16.2. 
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17.1 Introduction 

A graph consists of a set of vertices (nodes), along with a set of edges join¬ 
ing some pairs of the vertices. In graphical models, each vertex represents 
a random variable, and the graph gives a visual way of understanding the 
joint distribution of the entire set of random variables. They can be use¬ 
ful for either unsupervised or supervised learning. In an undirected graph , 
the edges have no directional arrows. We restrict our discussion to undi¬ 
rected graphical models, also known as Markov random fields or Markov 
networks. In these graphs, the absence of an edge between two vertices has 
a special meaning: the corresponding random variables are conditionally 
independent, given the other variables. 

Figure 17.1 shows an example of a graphical model for a flow-cytometry 
dataset with p = 11 proteins measured on N = 7466 cells, from Sachs 
et al. (2005). Each vertex in the graph corresponds to the real-valued ex¬ 
pression level of a protein. The network structure was estimated assuming 
a multivariate Gaussian distribution, using the graphical lasso procedure 
discussed later in this chapter. 

Sparse graphs have a relatively small number of edges, and are convenient 
for interpretation. They are useful in a variety of domains, including ge¬ 
nomics and proteomics, where they provide rough models of cell pathways. 
Much work has been done in defining and understanding the structure of 
graphical models; see the Bibliographic Notes for references. 
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Raf 



FIGURE 17.1. Example of a sparse undirected graph, estimated from a flow 
cytometry dataset, with p = 11 proteins measured on N = 7466 cells. The net¬ 
work structure was estimated using the graphical lasso procedure discussed in this 
chapter. 


As we will see, the edges in a graph are parametrized by values or po¬ 
tentials that encode the strength of the conditional dependence between 
the random variables at the corresponding vertices. The main challenges in 
working with graphical models are model selection (choosing the structure 
of the graph), estimation of the edge parameters from data, and compu¬ 
tation of marginal vertex probabilities and expectations, from their joint 
distribution. The last two tasks are sometimes called learning and inference 
in the computer science literature. 

We do not attempt a comprehensive treatment of this interesting area. 
Instead, we introduce some basic concepts, and then discuss a few sim¬ 
ple methods for estimation of the parameters and structure of undirected 
graphical models; methods that relate to the techniques already discussed 
in this book. The estimation approaches that we present for continuous 
and discrete-valued vertices are different, so we treat them separately. Sec¬ 
tions 17.3.1 and 17.3.2 may be of particular interest, as they describe new, 
regression-based procedures for estimating graphical models. 

There is a large and active literature on directed graphical models or 
Bayesian networks' these are graphical models in which the edges have 
directional arrows (but no directed cycles). Directed graphical models rep¬ 
resent probability distributions that can be factored into products of condi¬ 
tional distributions, and have the potential for causal interpretations. We 
refer the reader to Wasserman (2004) for a brief overview of both undi¬ 
rected and directed graphs; the next section follows closely his Chapter 18. 
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(c) (d) 

FIGURE 17.2. Examples of undirected graphical models or Markov networks. 
Each node or vertex represents a random variable, and the lack of an edge between 
two nodes indicates conditional independence. For example, in graph (a), X and 
Z are conditionally independent, given Y. In graph (b), Z is independent of each 
of X, Y, and W. 

A longer list of useful references is given in the Bibliographic Notes on 
page 645. 


17.2 Markov Graphs and Their Properties 

In this section we discuss the basic properties of graphs as models for the 
joint distribution of a set of random variables. We defer discussion of (a) 
parametrization and estimation of the edge parameters from data, and (b) 
estimation of the topology of a graph, to later sections. 

Figure 17.2 shows four examples of undirected graphs. A graph Q consists 
of a pair (V, E), where V is a set of vertices and E the set of edges (defined 
by pairs of vertices). Two vertices X and Y are called adjacent if there 
is a edge joining them; this is denoted by X ~ Y. A path X\,X 2 , ■ ■ ■, X n 
is a set of vertices that are joined, that is X^_i ~ Xi for i = 2,..., n. A 
complete graph is a graph with every pair of vertices joined by an edge. 
A subgraph U £ V is a subset of vertices together with their edges. For 
example, ( X , Y, Z) in Figure 17.2(a) form a path but not a complete graph. 

Suppose that we have a graph Q whose vertex set V represents a set of 
random variables having joint distribution P. In a Markov graph Q , the 
absence of an edge implies that the corresponding random variables are 
conditionally independent given the variables at the other vertices. This is 
expressed with the following notation: 
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No edge joining X and Y •<=>■ X _L Fjrest (17-1) 

where “rest” refers to all of the other vertices in the graph. For example 
in Figure 17.2(a) X _L Z\Y. These are known as the pairwise Markov 
independencies of Q. 

If A, B and C are subgraphs, then C is said to separate A and B if every 
path between A and B intersects a node in C. For example, Y separates 
X and Z in Figures 17.2(a) and (d), and Z separates Y and W in (d). In 
Figure 17.2(b) Z is not connected to X,Y,W so we say that the two sets 
are separated by the empty set. In Figure 17.2(c), C = {X,Zj separates 
Y and W. 

Separators have the nice property that they break the graph into con¬ 
ditionally independent pieces. Specifically, in a Markov graph Q with sub¬ 
graphs A , B and C, 

if C separates A and B then A 1 B\C. (17-2) 

These are known as the global Markov properties of Q. It turns out that the 
pairwise and global Markov properties of a graph are equivalent (for graphs 
with positive distributions). That is, the set of graphs with associated prob¬ 
ability distributions that satisfy the pairwise Markov independencies and 
global Markov assumptions are the same. This result is useful for inferring 
global independence relations from simple pairwise properties. For example 
in Figure 17.2(d) X _L Z\{Y, W} since it is a Markov graph and there is no 
link joining X and Z. But Y also separates X from Z and W and hence by 
the global Markov assumption we conclude that X _L Z\Y and X _L W\Y. 
Similarly we have Y _L W\Z. 

The global Markov property allows us to decompose graphs into smaller 
more manageable pieces and thus leads to essential simplifications in com¬ 
putation and interpretation. For this purpose we separate the graph into 
cliques. A clique is a complete subgraph— a set of vertices that are all 
adjacent to one another; it is called maximal if it is a clique and no other 
vertices can be added to it and still yield a clique. The maximal cliques for 


the graphs of Figure 

17.2 are 

(a) 

{- X,Y},{Y,Z }, 


(b) 

{A,y,VF},{£} 


(c) 

{X,Y},{Y,Z}, 

{Z, W},{X,W}, and 

(d) 

{X,Y},{Y,Z}, 

{Z, W}. 


Although the following applies to both continuous and discrete distri¬ 
butions, much of the development has been for the latter. A probability 
density function / over a Markov graph Q can be can represented as 
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f(x) = \W_ ^cOc) (17.3) 

^ cec 

where C is the set of maximal cliques, and the positive functions ipc {•) are 
called clique potentials. These are not in general density functions 1 , but 
rather are affinities that capture the dependence in Xq by scoring certain 
instances xq higher than others. The quantity 

*=eii (17.4) 

xexcec 

is the normalizing constant, also known as the partition function. Alterna¬ 
tively, the representation (17.3) implies a graph with independence prop¬ 
erties defined by the cliques in the product. This result holds for Markov 
networks Q with positive distributions, and is known as the Hammersley- 
Clifford theorem (Hammersley and Clifford, 1971; Clifford, 1990). 

Many of the methods for estimation and computation on graphs first de¬ 
compose the graph into its maximal cliques. Relevant quantities are com¬ 
puted in the individual cliques and then accumulated across the entire 
graph. A prominent example is the join tree or junction tree algorithm for 
computing marginal and low order probabilities from the joint distribution 
on a graph. Details can be found in Pearl (1986), Lauritzen and Spiegel- 
halter (1988), Pearl (1988), Shenoy and Shafer (1988), Jensen et al. (1990), 
or Roller and Friedman (2007). 



FIGURE 17.3. A complete graph does not uniquely specify the higher-order 
dependence structure in the joint distribution of the variables. 

A graphical model does not always uniquely specify the higher-order 
dependence structure of a joint probability distribution. Consider the com¬ 
plete three-node graph in Figure 17.3. It could represent the dependence 
structure of either of the following distributions: 

/ (2) (x,y,z) = y)ip(x, z)ip(y, z); 

f {3) (x,y,z ) = ±ip(x,y,z). 

The first specifies only second order dependence (and can be represented 
with fewer parameters). Graphical models for discrete data are a special 


1 If the cliques are separated, then the potentials can be densities, but this is in general 
not the case. 
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case of loglinear models for multiway contingency tables (Bishop et al., 
1975, e.g.); in that language /^ is referred to as the “no second-order 
interaction” model. 

For the remainder of this chapter we focus on pairwise Markov graphs 
(Koller and Friedman, 2007). Here there is a potential function for each 
edge (pair of variables as in /^ above), and at most second-order interac¬ 
tions are represented. These are more parsimonious in terms of parameters, 
easier to work with, and give the minimal complexity implied by the graph 
structure. The models for both continuous and discrete data are functions 
of only the pairwise marginal distributions of the variables represented in 
the edge set. 


17.3 Undirected Graphical Models for Continuous 
Variables 


Here we consider Markov networks where all the variables are continuous. 
The Gaussian distribution is almost always used for such graphical models, 
because of its convenient analytical properties. We assume that the observa¬ 
tions have a multivariate Gaussian distribution with mean p and covariance 
matrix £. Since the Gaussian distribution represents at most second-order 
relationships, it automatically encodes a pairwise Markov graph. The graph 
in Figure 17.1 is an example of a Gaussian graphical model. 

The Gaussian distribution has the property that all conditional distri¬ 
butions are also Gaussian. The inverse covariance matrix E -1 contains 
information about the partial covariances between the variables; that is, 
the covariances between pairs i and j, conditioned on all other variables. 
In particular, if the ij th component of © = £ _1 is zero, then variables i and 
j are conditionally independent, given the other variables (Exercise 17.3). 

It is instructive to examine the conditional distribution of one variable 
versus the rest, where the role of 0 is explicit. Suppose we partition X = 
(Z,Y) where Z = (Xi,... ,X p _{) consists of the first p — 1 variables and 
Y = X p is the last. Then we have the conditional distribution of Y give Z 
(Mardia et al., 1979, e.g.) 

Y\Z = z ~ N (/xy + (z — pz) T Y,- z 1 z cfzy , &yy — Vzy^zzVzy) , (17.6) 


where we have partitioned £ as 


£ = 



OZY 

ctyy 


(17.7) 


The conditional mean in (17.6) has exactly the same form as the pop¬ 
ulation multiple linear regression of Y on Z, with regression coefficient 
/3 = S zz a ZY [see (2.16) on page 19]. If we partition © in the same way, 
since £0 = I standard formulas for partitioned inverses give 
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(17.8) 


ft — ^ZZ azY 
= —Ozy/Oyy- 


(17.9) 


We have learned two things here: 

• The dependence of Y on Z in (17.6) is in the mean term alone. Here 
we see explicitly that zero elements in /? and hence 9zy mean that 
the corresponding elements of Z are conditionally independent of Y , 
given the rest. 

• We can learn about this dependence structure through multiple linear 
regression. 

Thus © captures all the second-order information (both structural and 
quantitative) needed to describe the conditional distribution of each node 
given the rest, and is the so-called “natural” parameter for the Gaussian 
graphical model 2 . 

Another (different) kind of graphical model is the covariance graph or rel¬ 
evance network , in which vertices are connected by bidirectional edges if the 
covariance (rather than the partial covariance) between the corresponding 
variables is nonzero. These are popular in genomics, see especially Butte 
et al. (2000). The negative log-likelihood from these models is not convex, 
making the computations more challenging (Chaudhuri et al., 2007). 

17.3.1 Estimation of the Parameters when the Graph 
Structure is Known 

Given some realizations of X , we would like to estimate the parameters 
of an undirected graph that approximates their joint distribution. Suppose 
first that the graph is complete (fully connected). We assume that we have 
N multivariate normal realizations Xi, i = 1,..., N with population mean 
p and covariance X. Let 



(17.10) 


be the empirical covariance matrix, with x the sample mean vector. Ignoring 
constants, the log-likelihood of the data can be written as 


2 The distribution arising from a Gaussian graphical model is a Wishart distribution. 
This is a member of the exponential family, with canonical or “natural” parameter 
© = X 1 . Indeed, the partially maximized log-likelihood (17.11) is (up to constants) 
the Wishart log-likelihood. 
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£(&) = logdet © — trace(S©). (17.11) 

In (17.11) we have partially maximized with respect to the mean parameter 
/i. The quantity —£(&) is a convex function of 0. It is easy to show that 
the maximum likelihood estimate of £ is simply S. 

Now to make the graph more useful (especially in high-dimensional set¬ 
tings) let’s assume that some of the edges are missing; for example, the 
edge between PIP3 and Erk is one of several missing in Figure 17.1. As we 
have seen, for the Gaussian distribution this implies that the correspond¬ 
ing entries of © = £ _1 are zero. Hence we now would like to maximize 
(17.11) under the constraints that some pre-defmed subset of the parame¬ 
ters are zero. This is an equality-constrained convex optimization problem, 
and a number of methods have been proposed for solving it, in particular 
the iterative proportional fitting procedure (Speed and Kiiveri, 1986). This 
and other methods are summarized for example in Whittaker (1990) and 
Lauritzen (1996). These methods exploit the simplifications that arise from 
decomposing the graph into its maximal cliques, as described in the previ¬ 
ous section. Here we outline a simple alternate approach, that exploits the 
sparsity in a different way. The fruits of this approach will become apparent 
later when we discuss the problem of estimation of the graph structure. 

The idea is based on linear regression, as inspired by (17.6) and (17.9). 
In particular, suppose that we want to estimate the edge parameters 6ij for 
the vertices that are joined to a given vertex i, restricting those that are not 
joined to be zero. Then it would seem that the linear regression of the node 
i values on the other relevant vertices might provide a reasonable estimate. 
But this ignores the dependence structure among the predictors in this 
regression. It turns out that if instead we use our current (model-based) 
estimate of the cross-product matrix of the predictors when we perform 
our regressions, this gives the correct solutions and solves the constrained 
maximum-likelihood problem exactly. We now give details. 

To constrain the log-likelihood (17.11), we add Lagrange constants for 
all missing edges 

£ c (&) = log det © - trace(S0) - ^ JjkOjk- (17.12) 

U,k)#E 

The gradient equation for maximizing (17.12) can be written as 

©^ 1 -S-T = 0, (17.13) 

using the fact that the derivative of logdet © equals 0” 1 (Boyd and Van- 
denberghe, 2004, for example, page 641). T is a matrix of Lagrange param¬ 
eters with nonzero values for all pairs with edges absent. 

We will show how we can use regression to solve for 0 and its inverse 
W = 0 1 one row and column at a time. For simplicity let’s focus on the 
last row and column. Then the upper right block of equation (17.13) can 
be written as 
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W \2 - s 12 - 712 = 0. (17-14) 

Here we have partitioned the matrices into two parts as in (17.7): part 1 
being the first p — 1 rows and columns, and part 2 the pth row and column. 
With W and its inverse 0 partitioned in a similar fashion, we have 


This implies 


( W n w 12 \ ( ©n 0 12 \ _ ( I 0\ 

V W 12 w 22 ) V °12 622 ) V 0 T 1 ) ' 


(17.15) 


W 12 = — Wh0i2/0 22 (17.16) 

= Wn/3 (17.17) 

where f3 = —O 12 /O 22 as in (17.9). Now substituting (17.17) into (17.14) 
gives 

W UJ 9- 512-712=0. (17.18) 

These can be interpreted as the p — 1 estimating equations for the con¬ 
strained regression of X p on the other predictors, except that the observed 
mean cross-products matrix Sn is replaced by Wn, the current estimated 
covariance matrix from the model. 

Now we can solve (17.18) by simple subset regression. Suppose there are 
p— q nonzero elements in 712 —i.e., p—q edges constrained to be zero. These 
p — q rows carry no information and can be removed. Furthermore we can 
reduce /3 to (3* by removing its p — q zero elements, yielding the reduced 
q x q system of equations 


WJuS* - *I 2 = 0, (17.19) 

with solution f)* = W* 1 ^ 1 s* 2 . This is padded with p—q zeros to give /3. 

Although it appears from (17.16) that we only recover the elements d\ 2 
up to a scale factor l/ 0 22 , it is easy to show that 

-E~ = w 22 ~ WwP (17.20) 

U22 

(using partitioned inverse formulas). Also w 22 = s 22 , since the diagonal of 
T in (17.13) is zero. 

This leads to the simple iterative procedure given in Algorithm 17.1 for 
estimating both W and its inverse 0, subject to the constraints of the 
missing edges. 

Note that this algorithm makes conceptual sense. The graph estimation 
problem is not p separate regression problems, but rather p coupled prob¬ 
lems. The use of the common W in step (b), in place of the observed 
cross-products matrix, couples the problems together in the appropriate 
fashion. Surprisingly, we were not able to find this procedure in the lit¬ 
erature. However it is related to the covariance selection procedures of 
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Algorithm 17.1 A Modified Regression Algorithm for Estimation of an 
Undirected Gaussian Graphical Model with Known Structure. 

1. Initialize W = S. 

2. Repeat for j = 1,2,... ,p until convergence: 

(a) Partition the matrix W into part 1: all but the jth row and 
column, and part 2: the jth row and column. 

(b) Solve WJd/3* — sf 2 = 0 for the unconstrained edge parameters 
/3*, using the reduced system of equations as in (17.19). Obtain 
$ by padding with zeros in the appropriate positions. 

(c) Update uq 2 = Wn/3 

3. In the final cycle (for each j ) solve for 0i 2 = —$ • 0 22 , with 1/022 = 
S 22 - wf 2 0. 


1 5 4 

10 2 6 

2 10 3 

6 3 10/ 


FIGURE 17.4. A simple graph for illustration, along with the empirical covari¬ 
ance matrix. 

Dempster (1972), and is similar in flavor to the iterative conditional fitting 
procedure for covariance graphs, proposed by Chaudhuri et al. (2007). 

Here is a little example, borrowed from Whittaker (1990). Suppose that 
our model is as depicted in Figure 17.4, along with its empirical covariance 
matrix S. We apply algorithm (17.1) to this problem; for example, in the 
modified regression for variable 1 in step (b), variable 3 is left out. The 
procedure quickly converged to the solutions: 


/10.00 

1.00 

1.31 

4.00 \ 

/ 0.12 

-0.01 

0.00 

-0.05 

1.00 

10.00 

2.00 

0.87 ± _ 1 = 

-0.01 

0.11 

-0.02 

0.00 

1.31 

2.00 

10.00 

3.00 ’ 

0.00 

-0.02 

0.11 

-0.03 

\ 4.00 

0.87 

3.00 

10.00/ 

\—0.05 

0.00 

-0.03 

0.13 


Note the zeroes in S' 1 , corresponding to the missing edges (1,3) and (2,4). 
Note also that the corresponding elements in S are the only elements dif¬ 
ferent from S. The estimation of S is an example of what is sometimes 
called the positive definite “completion” of S. 
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17.3.2 Estimation of the Graph Structure 

In most cases we do not know which edges to omit from our graph, and 
so would like to try to discover this from the data itself. In recent years a 
number of authors have proposed the use of Li (lasso) regularization for 
this purpose. 

Meinshausen and Buhlmann (2006) take a simple approach to the prob¬ 
lem: rather than trying to fully estimate S or 0 = S _1 , they only estimate 
which components of dij are nonzero. To do this, they fit a lasso regression 
using each variable as the response and the others as predictors. The com¬ 
ponent dij is then estimated to be nonzero if either the estimated coefficient 
of variable i on j is nonzero, OR the estimated coefficient of variable j on 
i is nonzero (alternatively they use an AND rule). They show that asymp¬ 
totically this procedure consistently estimates the set of nonzero elements 
of©. 

We can take a more systematic approach with the lasso penalty, following 
the development of the previous section. Consider maximizing the penalized 
log-likelihood 

logdet © — trace(S©) — A||©||i, (17.21) 

where ||©||i is the L\ norm—the sum of the absolute values of the elements 
of S -1 , and we have ignored constants. The negative of this penalized 
likelihood is a convex function of ©. 

It turns out that one can adapt the lasso to give the exact maximizer of 
the penalized log-likelihood. In particular, we simply replace the modified 
regression step (b) in Algorithm 17.1 by a modified lasso step. Here are the 
details. 

The analog of the gradient equation (17.13) is now 

0" 1 - S - A • Sign(0) =0. (17.22) 

Here we use sub-gradient notation, with Sign(0jfc) = sign(0jfc) if djk 7^ 0, 
else Sign(0jfc) £ [—1,1] if = 0. Continuing the development in the 
previous section, we reach the analog of (17.18) 


W n /3 — 8l2 + A • SignCS) = 0 (17.23) 

(recall that /3 and di 2 have opposite signs). We will now see that this system 
is exactly equivalent to the estimating equations for a lasso regression. 

Consider the usual regression setup with outcome variables y and pre¬ 
dictor matrix Z. There the lasso minimizes 

i(y-Z^) T (y-Z/3) +A-II/3H! (17.24) 

[see (3.52) on page 68; here we have added a factor \ for convenience]. The 
gradient of this expression is 
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Algorithm 17.2 Graphical Lasso. 

1. Initialize W = S + AI. The diagonal of W remains unchanged in 
what follows. 

2. Repeat for j = 1, 2,.. .p, 1, 2,.. .p,. .. until convergence: 

(a) Partition the matrix W into part 1: all but the jth row and 
column, and part 2: the jth row and column. 

(b) Solve the estimating equations Wn/3 — S 12 + A • Sign(/3) = 0 
using the cyclical coordinate-descent algorithm (17.26) for the 
modified lasso. 

(c) Update w \2 = ffn $ 

3. In the final cycle (for each j) solve for 612 = —$ ■ Q 22 , with 1/022 = 
W 22 -Wi 2 &- 


Z t Z/ 3 - Z T y + A • Sign(/3) = 0 (17.25) 

So up to a factor 1/N, Z T y is the analog of S 12 , and we replace Z T Z by 
Wn, the estimated cross-product matrix from our current model. 

The resulting procedure is called the graphical lasso , proposed by Fried¬ 
man et al. (2008b) building on the work of Banerjee et al. (2008). It is 
summarized in Algorithm 17.2. 

Friedman et al. (2008b) use the pathwise coordinate descent method 
(Section 3.8.6) to solve the modified lasso problem at each stage. Here are 
the details of pathwise coordinate descent for the graphical lasso algorithm. 
Letting V = Wn, the update has the form 

k <- s(s 12j - J2 VkA,\) /Vjj (17.26) 

for j = 1,2,... ,p — 1,1, 2,... ,p — 1,..where S is the soft-threshold 
operator: 

S(x,t) =sign(x)(\x\-t) + . (17.27) 

The procedure cycles through the predictors until convergence. 

It is easy to show that the diagonal elements Wjj of the solution matrix 
W are simply Sjj + A, and these are fixed in step 1 of Algorithm 17.2 3 . 

The graphical lasso algorithm is extremely fast, and can solve a moder¬ 
ately sparse problem with 1000 nodes in less than a minute. It is easy to 
modify the algorithm to have edge-specific penalty parameters A jk] since 


3 An alternative formulation of the problem (17.21) can be posed, where we don’t 
penalize the diagonal of ©. Then the diagonal elements Wjj of the solution matrix are 
Sjj, and the rest of the algorithm is unchanged. 
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A jk = oo will force djk to be zero, this algorithm subsumes Algorithm 17.1. 
By casting the sparse inverse-covariance problem as a series of regressions, 
one can also quickly compute and examine the solution paths as a function 
of the penalty parameter A. More details can be found in Friedman et al. 
(2008b). 


A = 36 A = 27 



FIGURE 17.5. Four different graphical-lasso solutions for the flow-cytometry 
data. 

Figure 17.1 shows the result of applying the graphical lasso to the flow- 
cytometry dataset. Here the lasso penalty parameter A was set at 14. In 
practice it is informative to examine the different sets of graphs that are 
obtained as A is varied. Figure 17.5 shows four different solutions. The 
graph becomes more sparse as the penalty parameter is increased. 

Finally note that the values at some of the nodes in a graphical model can 
be unobserved; that is, missing or hidden. If only some values are missing 
at a node, the EM algorithm can be used to impute the missing values 
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(Exercise 17.9). However, sometimes the entire node is hidden or latent. 
In the Gaussian model, if a node has all missing values, due to linearity 
one can simply average over the missing nodes to yield another Gaussian 
model over the observed nodes. Hence the inclusion of hidden nodes does 
not enrich the resulting model for the observed nodes; in fact, it imposes 
additional structure on its covariance matrix. However in the discrete model 
(described next) the inherent nonlinearities make hidden units a powerful 
way of expanding the model. 

17.4 Undirected Graphical Models for Discrete 
Variables 

Undirected Markov networks with all discrete variables are popular, and 
in particular pairwise Markov networks with binary variables being the 
most common. They are sometimes called Ising models in the statistical 
mechanics literature, and Boltzmann machines in the machine learning lit¬ 
erature, where the vertices are referred to as “nodes” or “units” and are 
binary-valued. 

In addition, the values at each node can be observed (“visible”) or un¬ 
observed (“hidden”). The nodes are often organized in layers, similar to a 
neural network. Boltzmann machines are useful both for unsupervised and 
supervised learning, especially for structured input data such as images, 
but have been hampered by computational difficulties. Figure 17.6 shows 
a restricted Boltzmann machine (discussed later), in which some variables 
are hidden, and only some pairs of nodes are connected. We first consider 
the simpler case in which all p nodes are visible with edge pairs (j, k ) enu¬ 
merated in E. 

Denoting the binary valued variable at node j by Xj , the Ising model 
for their joint probabilities is given by 



(17.28) 


\j,k)eE 


with X = {0,1} P . As with the Gaussian model of the previous section, 
only pairwise interactions are modeled. The Ising model was developed in 
statistical mechanics, and is now used more generally to model the joint 
effects of pairwise interactions. <!>(©) is the log of the partition function, 
and is defined by 



(17.29) 




U,k)eE 


The partition function ensures that the probabilities add to one over the 
sample space. The terms djkXjXk represent a particular parametrization 
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of the (log) potential functions (17.5), and for technical reasons requires 
a constant node X 0 = 1 to be included (Exercise 17.10), with “edges” to 
all the other nodes. In the statistics literature, this model is equivalent 
to a first-order-interaction Poisson log-linear model for multiway tables of 
counts (Bishop et al., 1975; McCullagh and Nelder, 1989; Agresti, 2002). 

The Ising model implies a logistic form for each node conditional on the 
others (exercise 17.11): 


Pr(X, = 1| X_j = x_j) 


1 

1 + exp( —0j O - E (j,k)eE e 3kXk ) ’ 


(17.30) 


where X_. denotes all of the nodes except j. Hence the parameter 9j k 
measures the dependence of Xj on X k , conditional on the other nodes. 


17.4-1 Estimation of the Parameters when the Graph 
Structure is Known 


Given some data from this model, how can we estimate the parameters? 
Suppose we have observations xi = (x,i, X&, ■ • ■, Xi P ) £ {0,1} P , i = 1,..., N. 
The log-likelihood is 


£(©) 


N 

Y log Pre(E; = Xi) 

i= 1 


N 

E E 

*=1 U,k)eE 


$( 0 ) 


(17.31) 


The gradient of the log-likelihood is 


and 


d£(&) 

d 0 jk 


N 


E 


XijXik X 


a$(©) 

doj k 


<9$(0) 

dOjk 


E X 3 Xk 'P( X ’ ®) 

x&X 


E@iX.Xk) 


Setting the gradient to zero gives 


E{XjX k ) - E@(X.X k ) = 0 


(17.32) 


(17.33) 


(17.34) 


where we have defined 
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1 . , 

V( x j x k) = ^ ^2 x io x ik, (17.35) 

V i=1 

the expectation taken with respect to the empirical distribution of the data. 
Looking at (17.34), we see that the maximum likelihood estimates simply 
match the estimated inner products between the nodes to their observed 
inner products. This is a standard form for the score (gradient) equation 
for exponential family models, in which sufficient statistics are set equal to 
their expectations under the model. 

To find the maximum likelihood estimates, we can use gradient search 
or Newton methods. However the computation of E &{XjX]f) involves enu¬ 
meration of p(X, 0) over 2 P ~ 2 of the \X\ = 2 P possible values of X , and is 
not generally feasible for large p (e.g., larger than about 30). For smaller 
p , a number of standard statistical approaches are available: 

Poisson log-linear modeling , where we treat the problem as a large regres¬ 
sion problem (Exercise 17.12). The response vector y is the vector of 
2 P counts in each of the cells of the multiway tabulation of the data 4 . 
The predictor matrix Z has 2 P rows and up to 1+p+p 2 columns that 
characterize each of the cells, although this number depends on the 
sparsity of the graph. The computational cost is essentially that of a 
regression problem of this size, which is 0 {p A 2 P ) and is manageable 
for p < 20. The Newton updates are typically computed by iteratively 
reweighted least squares, and the number of steps is usually in the 
single digits. See Agresti (2002) and McCullagh and Nelder (1989) for 
details. Standard software (such as the R package glm) can be used 
to fit this model. 

Gradient descent requires at most 0(p 2 2 P ~ 2 ) computations to compute 
the gradient, but may require many more gradient steps than the 
second-order Newton methods. Nevertheless, it can handle slightly 
larger problems with p < 30. These computations can be reduced 
by exploiting the special clique structure in sparse graphs, using the 
junction-tree algorithm. Details are not given here. 

Iterative proportional fitting (IPF) performs cyclical coordinate descent on 
the gradient equations (17.34). At each step a parameter is updated 
so that its gradient equation is exactly zero. This is done in a cyclical 
fashion until all the gradients are zero. One complete cycle costs the 
same as a gradient evaluation, but may be more efficient. Jirousek and 
Preucil (1995) implement an efficient version of IPF, using junction 
trees. 


4 Each of the cell counts is treated as an independent Poisson variable. We get the 
multinomial model corresponding to (17.28) by conditioning on the total count N (which 
is also Poisson under this framework). 
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When p is large (> 30) other approaches have been used to approximate 
the gradient. 

• The mean field approximation (Peterson and Anderson, 1987) esti¬ 
mates E @(XjXk) by E©(A' J )E©(Xj), and replaces the input vari¬ 
ables by their means, leading to a set of nonlinear equations for the 
parameters djk- 

• To obtain near-exact solutions, Gibbs sampling (Section 8.6) is used 
to approximate E @(XjXk) by successively sampling from the esti¬ 
mated model probabilities Pr©(A J |A_ J ) (see e.g. Ripley (1996)). 

We have not discussed decomposable models , for which the maximum 
likelihood estimates can be found in closed form without any iteration 
whatsoever. These models arise, for example, in trees: special graphs with 
tree-structured topology. When computational tractability is a concern, 
trees represent a useful class of models and they sidestep the computational 
concerns raised in this section. For details, see for example Chapter 12 of 
Whittaker (1990). 

17-4-2 Hidden Nodes 

We can increase the complexity of a discrete Markov network by including 
latent or hidden nodes. Suppose that a subset of the variables Xu are 
unobserved or “hidden”, and the remainder Xy are observed or “visible.” 
Then the log-likelihood of the observed data is 



£(&) 


N 

^log[Pr©(X v = x iV )\ 


i— 1 
N 

E[ lo s E exp E (0jkXijXik lb)©)) 

*= 1 (j,k)GE 


(17.36) 


The sum over xu means that we are summing over all possible {0,1} values 
for the hidden units. The gradient works out to be 

= E v E©(A,A fc |X v ) - E & (XjX k ) (17.37) 

at ijk 

The first term is an empirical average of XjXk if both are visible; if one 
or both are hidden, they are first imputed given the visible data, and then 
averaged over the hidden variables. The second term is the unconditional 
expectation of XjX^. 

The inner expectation in the first term can be evaluated using basic rules 
of conditional expectation and properties of Bernoulli random variables. In 
detail, for observation i 
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XijX ik if j,k S V 

E & (XjX k \X v = x iV ) = { XijPr&(X k = 1\X V = x iV ) if j eV,k gH 

Pt & (x j =i,x k = i\x v = X iv ) if j, ken. 

(17.38) 

Now two separate runs of Gibbs sampling are required; the first to estimate 
E @(XjXk) by sampling from the model as above, and the second to esti¬ 
mate E©(XjXfc|Xv = Xi\>). In this latter run, the visible units are fixed 
(“clamped”) at their observed values and only the hidden variables are 
sampled. Gibbs sampling must be done for each observation in the training 
set, at each stage of the gradient search. As a result this procedure can be 
very slow, even for moderate-sized models. In Section 17.4.4 we consider 
further model restrictions to make these computations manageable. 

17.4-3 Estimation of the Graph Structure 

The use of a lasso penalty with binary pairwise Markov networks has been 
suggested by Lee et al. (2007) and Wainwright et al. (2007). The first au¬ 
thors investigate a conjugate gradient procedure for exact maximization of 
a penalized log-likelihood. The bottleneck is the computation of E©(XjXfc) 
in the gradient; exact computation via the junction tree algorithm is man¬ 
ageable for sparse graphs but becomes unwieldy for dense graphs. 

The second authors propose an approximate solution, analogous to the 
Meinshausen and Biihlmann (2006) approach for the Gaussian graphical 
model. They fit an Li-penalized logistic regression model to each node as 
a function of the other nodes, and then symmetrize the edge parameter 
estimates in some fashion. For example if 9j k is the estimate of the j-k 
edge parameter from the logistic model for outcome node j, the “min” 
symmetrization sets 9j k to either 9j k or 9 k j, whichever is smallest in abso¬ 
lute value. The “max” criterion is defined similarly. They show that under 
certain conditions either approximation estimates the nonzero edges cor¬ 
rectly as the sample size goes to infinity. Hoefling and Tibshirani (2008) 
extend the graphical lasso to discrete Markov networks, obtaining a pro¬ 
cedure which is somewhat faster than conjugate gradients, but still must 
deal with computation of E @(XjX k ). They also compare the exact and 
approximate solutions in an extensive simulation study and find the “min” 
or “max” approximations are only slightly less accurate than the exact pro¬ 
cedure, both for estimating the nonzero edges and for estimating the actual 
values of the edge parameters, and are much faster. Furthermore, they can 
handle denser graphs because they never need to compute the quantities 
E & {X 3 X k ). 

Finally, we point out a key difference between the Gaussian and binary 
models. In the Gaussian case, both X and its inverse will often be of interest, 
and the graphical lasso procedure delivers estimates for both of these quan¬ 
tities. However, the approximation of Meinshausen and Biihlmann (2006) 
for Gaussian graphical models, analogous to the Wainwright et al. (2007) 
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FIGURE 17.6. A restricted Boltzmann machine (RBM) in which there are no 
connections between nodes in the same layer. The visible units are subdivided to 
allow the RBM to model the joint density of feature Vi and their labels V 2 . 

approximation for the binary case, only yields an estimate of £ _1 . In con¬ 
trast, in the Markov model for binary data, © is the object of interest, and 
its inverse is not of interest. The approximate method of Wainwright et al. 
(2007) estimates © efficiently and hence is an attractive solution for the 
binary problem. 


17.4-4 Restricted Boltzmann Machines 

In this section we consider a particular architecture for graphical models 
inspired by neural networks, where the units are organized in layers. A 
restricted Boltzmann machine (RBM) consists of one layer of visible units 
and one layer of hidden units with no connections within each layer. It 
is much simpler to compute the conditional expectations (as in 17.37 and 
17.38) if the connections between hidden units are removed 5 . Figure 17.6 
shows an example; the visible layer is divided into input variables Vi and 
output variables V 2 , and there is a hidden layer TL. We denote such a 
network by 

V 10 H 0 V 2 . (17.39) 

For example, Vi could be the binary pixels of an image of a handwritten 
digit, and V 2 could have 10 units, one for each of the observed class labels 
0-9. 

The restricted form of this model simplifies the Gibbs sampling for es¬ 
timating the expectations in (17.37), since the variables in each layer are 
independent of one another, given the variables in the other layers. Hence 
they can be sampled together, using the conditional probabilities given by 
expression (17.30). 

The resulting model is less general than a Boltzmann machine, but is still 
useful; for example it can learn to extract interesting features from images. 


5 We thank Geoffrey Hinton for assistance in the preparation of the material on RBMs. 



644 


17. Undirected Graphical Models 


By alternately sampling the variables in each layer of the RBM shown 
in Figure 17.6, it is possible to generate samples from the joint density 
model. If the Vi part of the visible layer is clamped at a particular feature 
vector during the alternating sampling, it is possible to sample from the 
distribution over labels given Vi. Alternatively classification of test items 
can also be achieved by comparing the unnormalized joint densities of each 
label category with the observed features. We do not need to compute the 
partition function as it is the same for all of these combinations. 

As noted the restricted Boltzmann machine has the same generic form 
as a single hidden layer neural network (Section 11.3). The edges in the 
latter model are directed, the hidden units are usually real-valued, and the 
fitting criterion is different. The neural network minimizes the error (cross¬ 
entropy) between the targets and their model predictions, conditional on 
the input features. In contrast, the restricted Boltzmann machine maxi¬ 
mizes the log-likelihood for the joint distribution of all visible units—that 
is, the features and targets. It can extract information from the input fea¬ 
tures that is useful for predicting the labels, but, unlike supervised learning 
methods, it may also use some of its hidden units to model structure in the 
feature vectors that is not immediately relevant for predicting the labels. 
These features may turn out to be useful, however, when combined with 
features derived from other hidden layers. 

Unfortunately, Gibbs sampling in a restricted Boltzmann machine can 
be very slow, as it can take a long time to reach stationarity. As the net¬ 
work weights get larger, the chain mixes more slowly and we need to run 
more steps to get the unconditional estimates. Hinton (2002) noticed em¬ 
pirically that learning still works well if we estimate the second expectation 
in (17.37) by starting the Markov chain at the data and only running for a 
few steps (instead of to convergence). He calls this contrastive divergence: 
we sample H given Vi, V 2 , then Vi, V 2 given H and finally 7i given Vi, V 2 
again. The idea is that when the parameters are far from the solution, it 
may be wasteful to iterate the Gibbs sampler to stationarity, as just a single 
iteration will reveal a good direction for moving the estimates. 

We now give an example to illustrate the use of an RBM. Using con¬ 
trastive divergence, it is possible to train an RBM to recognize hand-written 
digits from the MNIST dataset (LeCun et al., 1998). With 2000 hidden 
units, 784 visible units for representing binary pixel intensities and one 
10-way multinomial visible unit for representing labels, the RBM achieves 
an error rate of 1.9% on the test set. This is a little higher than the 1.4% 
achieved by a support vector machine and comparable to the error rate 
achieved by a neural network trained with backpropagation. The error rate 
of the RBM, however, can be reduced to 1.25% by replacing the 784 pixel 
intensities by 500 features that are produced from the images without using 
any label information. First, an RBM with 784 visible units and 500 hidden 
units is trained, using contrastive divergence, to model the set of images. 
Then the hidden states of the first RBM are used as data for training a 
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FIGURE 17.7. Example of a restricted Boltzmann machine for handwritten 
digit classification. The network is depicted in the schematic on the left. Displayed 
on the right are some difficult test images that the model classifies correctly. 


second RBM that has 500 visible units and 500 hidden units. Finally, the 
hidden states of the second RBM are used as the features for training an 
RBM with 2000 hidden units as a joint density model. The details and 
justification for learning features in this greedy, layer-by-layer way are de¬ 
scribed in Hinton et al. (2006). Figure 17.7 gives a representation of the 
composite model that is learned in this way and also shows some examples 
of the types of distortion that it can cope with. 


Bibliographic Notes 

Much work has been done in defining and understanding the structure of 
graphical models. Comprehensive treatments of graphical models can be 
found in Whittaker (1990), Lauritzen (1996), Cox and Wermuth (1996), 
Edwards (2000), Pearl (2000), Anderson (2003), Jordan (2004), and Roller 
and Friedman (2007). Wasserman (2004) gives a brief introduction, and 
Chapter 8 of Bishop (2006) gives a more detailed overview. Boltzmann 
machines were proposed in Ackley et al. (1985). Ripley (1996) has a detailed 
chapter on topics in graphical models that relate to machine learning. We 
found this particularly useful for its discussion of Boltzmann machines. 


Exercises 


Ex. 17.1 For the Markov graph of Figure 17.8, list all of the implied condi¬ 
tional independence relations and find the maximal cliques. 
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FIGURE 17.8. 


Ex. 17.2 Consider random variables X 2 , X 3 , X 4 . In each of the following 
cases draw a graph that has the given independence relations: 

(a) Xx ± X 3 |X 2 and X 2 _L X 4 \X 3 . 

(b) Xx JL X 4 |X 2 ,X 3 and X 2 J_ X 4 \Xx,X 3 . 

(c) Xx JL X 4 |X 2 ,X 3 , Xx JL X 3 |X 2 ,X 4 and X 3 JL X i \X 1 ,X 2 . 

Ex. 17.3 Let S be the covariance matrix of a set of p variables X. Consider 
the partial covariance matrix E a j, = S aa — between the two 

subsets of variables X a = (Xx,X 2 ) consisting of the first two, and Xb 
the rest. This is the covariance matrix between these two variables, after 
linear adjustment for all the rest. In the Gaussian distribution, this is the 
covariance matrix of the conditional distribution of X a \Xb- The partial 
correlation coefficient Purest between the pair X a conditional on the rest 
Xb , is simply computed from this partial covariance. Define 0 = X - . 

1. Show that S a .j, = 0“^. 

2. Show that if any off-diagonal element of 0 is zero, then the partial 
correlation coefficient between the corresponding variables is zero. 

3. Show that if we treat © as if it were a covariance matrix, and compute 
the corresponding “correlation” matrix 

R = diag(0) _1/2 • 0 • diag(0) _1/2 , (17.40) 

then rjk = — Pjk\iest 
Ex. 17.4 Denote by 

f(Xx\X 2 ,X 3 ,...,X P ) 

the conditional density of X\ given X 2 ,..., X p . If 

f(Xx\X 2 , X 3 ,..., X p ) = f(Xx\X 3 ,..., X p ), 
show that X\ J_ X 2 \X 3 ,..., X p . 
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Ex. 17.5 Consider the setup in Section 17.3.1 with no missing edges. Show 
that 

Sll/3 — Si2 = 0 

are the estimating equations for the multiple regression coefficients of the 
last variable on the rest. 

Ex. 17.6 Recovery of® = E _1 from Algorithm 17.1. Use expression (17.16) 
to derive the standard partitioned inverse expressions 

0i2 = -Wr> 12 0 22 (17.41) 

022 = 1/(W22 - < 2 wr> 12 ). (17.42) 

Since $ = W(" 1 1 u;i 2 , show that 0 22 = l/(«’22 — 'ief 2 /3) and 0i 2 = —/30 22 . 
Thus 0i 2 is a simply rescaling of /3 by — 0 22 . 

Ex. 17.7 Write a program to implement the modified regression procedure 
(17.1) for fitting the Gaussian graphical model with pre-specified edges 
missing. Test it on the flow cytometry data from the book website, using 
the graph of Figure 17.1. 

Ex. 17.8 


(a) Write a program to fit the lasso using the coordinate descent procedure 
(17.26). Compare its results to those from the lars program or some 
other convex optimizer, to check that it is working correctly. 


(b) Using the program from (a), write code to implement the graphical 
lasso algorithm (17.2). Apply it to the flow cytometry data from the 
book website. Vary the regularization parameter and examine the 
resulting networks. 


Ex. 17.9 Suppose that we have a Gaussian graphical model in which some 
or all of the data at some vertices are missing. 


(a) Consider the EM algorithm for a dataset of N i.i.d. multivariate ob¬ 
servations Xi € IR P with mean /r and covariance matrix S. For each 
sample i , let Oi and rrii index the predictors that are observed and 
missing, respectively. Show that in the E step, the observations are 
imputed from the current estimates of n and S: 


= E(a 


0) - h'm.i V ^77^ ,Oi Oi (*U,°i h’Oi ) 


(17.43) 


while in the M step, yi and S are re-estimated from the empirical 
mean and (modified) covariance of the imputed data: 

N 

fij = Xjj/N 

i=i 
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N 

£jj' = ~ h)+ c i,jA/ N (17.44) 

»=l 

where Ci jj> = T,jji if j. j' £ rrii and zero otherwise. Explain the reason 
for the correction term (Little and Rubin, 2002). 

(b) Implement the EM algorithm for the Gaussian graphical model using 

the modified regression procedure from Exercise 17.7 for the M-step. 

(c) For the flow cytometry data on the book website, set the data for the 

last protein Jnk in the first 1000 observations to missing, fit the model 
of Figure 17.1, and compare the predicted values to the actual values 
for Jnk. Compare the results to those obtained from a regression of 
Jnk on the other vertices with edges to Jnk in Figure 17.1, using only 
the non-missing data. 

Ex. 17.10 Using a simple binary graphical model with just two variables, 
show why it is essential to include a constant node Xq = 1 in the model. 

Ex. 17.11 Show that the Ising model (17.28) for the joint probabilities in 
a discrete graphical model implies that the conditional distributions have 
the logistic form (17.30). 

Ex. 17.12 Consider a Poisson regression problem with p binary variables 
Xij, j = 1 ,,p and response variable i/i which measures the number of 
observations with predictor aq £ {0,1} P . The design is balanced, in that all 
n = 2 P possible combinations are measured. We assume a log-linear model 
for the Poisson mean in each cell 


log^(X) = 6» 00 + ^2 x ij x ik8jk, (17.45) 

U,k)eE 


using the same notation as in Section 17.4.1 (including the constant variable 
Xm = lVi). We assume the response is distributed as 


Pr(F = y\X = x) 


e-^n{x) y 

y'- 


(17.46) 


Write down the conditional log-likelihood for the observed responses yi, 
and compute the gradient. 

(a) Show that the gradient equation for 6 qo computes the partition func¬ 

tion (17.29). 

(b) Show that the gradient equations for the remainder of the parameters 

are equivalent to the gradient (17.34). 
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High-Dimensional Problems: p N 


18.1 When p is Much Bigger than N 


In this chapter we discuss prediction problems in which the number of 
features p is much larger than the number of observations TV, often written 
p TV. Such problems have become of increasing importance, especially in 
genomics and other areas of computational biology. We will see that high 
variance and overfitting are a major concern in this setting. As a result, 
simple, highly regularized approaches often become the methods of choice. 
The first part of the chapter focuses on prediction in both the classification 
and regression settings, while the second part discusses the more basic 
problem of feature selection and assessment. 

To get us started, Figure 18.1 summarizes a small simulation study that 
demonstrates the “less fitting is better” principle that applies when p TS> TV. 
For each of TV = 100 samples, we generated p standard Gaussian features 
X with pairwise correlation 0.2. The outcome Y was generated according 
to a linear model 


p 



(18.1) 


where e was generated from a standard Gaussian distribution. For each 
dataset, the set of coefficients Bj were also generated from a standard Gaus¬ 
sian distribution. We investigated three cases: p = 20,100, and 1000. The 
standard deviation a was chosen in each case so that the signal-to-noise 
ratio Var[E(y|A')]/cr 2 equaled 2. As a result, the number of significant uni- 
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20 features 


100 features 


1000 features 





Effective Degrees of Freedom 


FIGURE 18.1. Test-error results for simulation experiments. Shown are box- 
plots of the relative test errors over 100 simulations, for three different values 
of p, the number of features. The relative error is the test error divided by the 
Bayes error, a 2 . From left to right, results are shown for ridge regression with 
three different values of the regularization parameter A: 0.001, 100 and 1000. The 
(average) effective degrees of freedom in the fit is indicated below each plot. 


variate regression coefficients 1 was 9, 33 and 331, respectively, averaged 
over the 100 simulation runs. The p = 1000 case is designed to mimic the 
kind of data that we might see in a high-dimensional genomic or proteomic 
dataset, for example. 

We fit a ridge regression to the data, with three different values for the 
regularization parameter A: 0.001, 100, and 1000. When A = 0.001, this 
is nearly the same as least squares regression, with a little regularization 
just to ensure that the problem is non-singular when p > N. Figure 18.1 
shows boxplots of the relative test error achieved by the different estimators 
in each scenario. The corresponding average degrees of freedom used in 
each ridge-regression fit is indicated (computed using formula (3.50) on 
page 68 2 ). The degrees of freedom is a more interpretable parameter than 
A. We see that ridge regression with A = 0.001 (20 df) wins when p = 20; 
A = 100 (35 df) wins when p = 100, and A = 1000 (43 df) wins when 

p = 1000, 

Here is an explanation for these results. When p = 20, we fit all the way 
and we can identify as many of the significant coefficients as possible with 


1 We call a regression coefficient significant if \f)j/Sej \ > 2, where Sj is the estimated 
(univariate) coefficient and sej is its estimated standard error. 

2 For a fixed value of the regularization parameter A, the degrees of freedom depends 
on the observed predictor values in each simulation. Hence we compute the average 
degrees of freedom over simulations. 
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low bias. When p = 100, we can identify some non-zero coefficients using 
moderate shrinkage. Finally, when p = 1000, even though there are many 
nonzero coefficients, we don’t have a hope for finding them and we need 
to shrink all the way down. As evidence of this, let tj = /3j/sej, where $j 
is the ridge regression estimate and se^- its estimated standard error. Then 
using the optimal ridge parameter in each of the three cases, the median 
value of |tj | was 2.0, 0.6 and 0.2, and the average number of 11 3 | values 
exceeding 2 was equal to 9.8, 1.2 and 0.0. 

Ridge regression with A = 0.001 successfully exploits the correlation in 
the features when p < N, but cannot do so when p N. In the latter case 
there is not enough information in the relatively small number of samples 
to efficiently estimate the high-dimensional covariance matrix. In that case, 
more regularization leads to superior prediction performance. 

Thus it is not surprising that the analysis of high-dimensional data re¬ 
quires either modification of procedures designed for the N > p scenario, or 
entirely new procedures. In this chapter we discuss examples of both kinds 
of approaches for high dimensional classification and regression; these meth¬ 
ods tend to regularize quite heavily, using scientific contextual knowledge 
to suggest the appropriate form for this regularization. The chapter ends 
with a discussion of feature selection and multiple testing. 


18.2 Diagonal Linear Discriminant Analysis and 
Nearest Shrunken Centroids 

Gene expression arrays are an important new technology in biology, and 
are discussed in Chapters 1 and 14. The data in our next example form 
a matrix of 2308 genes (columns) and 63 samples (rows), from a set of 
microarray experiments. Each expression value is a log-ratio log (R/G). R 
is the amount of gene-specific RNA in the target sample that hybridizes 
to a particular (gene-specific) spot on the microarray, and G is the corre¬ 
sponding amount of RNA from a reference sample. The samples arose from 
small, round blue-cell tumors (SRBCT) found in children, and are classified 
into four major types: BL (Burkitt lymphoma), EWS (Ewing’s sarcoma), 
NB (neuroblastoma), and RMS (rhabdomyosarcoma). There is an addi¬ 
tional test data set of 20 observations. We will not go into the scientific 
background here. 

Since p N, we cannot fit a full linear discriminant analysis (LDA) to 
the data; some sort of regularization is needed. The method we describe 
here is similar to the methods of Section 4.3.1, but with important modifi¬ 
cations that achieve feature selection. The simplest form of regularization 
assumes that the features are independent within each class, that is, the 
within-class covariance matrix is diagonal. Despite the fact that features 
will rarely be independent within a class, when p N we don’t have 
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enough data to estimate their dependencies. The assumption of indepen¬ 
dence greatly reduces the number of parameters in the model and often 
results in an effective and interpretable classifier. 

Thus we consider the diagonal-covariance LDA rule for classifying the 
classes. The discriminant score [see (4.12) on page 110] for class k is 

6k(x*) = -Y]— — 2 ^ +21og7r fc . (18.2) 

ft 

Here x* = {x\,x^, ■ ■ ■ ,x*) T is a vector of expression values for a test ob¬ 
servation, Sj is the pooled within-class standard deviation of the jth gene, 
and x kj = X ij/Nk is the mean of the Nk values for gene j in class 

k, with Ck being the index set for class k. We call Xk = ( Xki, Xk 2 , ■ ■ ■ Xk P ) T 
the centroid of class k. The first part of (18.2) is simply the (negative) 
standardized squared distance of x* to the kth centroid. The second part 
is a correction based on the class prior probability irk, where "Yl,k=i = 1- 
The classification rule is then 


C(x*) = i if 6g(x*) = maxfc 5k{x*). (18.3) 


We see that the diagonal LDA classifier is equivalent to a nearest centroid 
classifier after appropriate standardization. It is also a special case of the 
naive-Bayes classifier, as described in Section 6.6.3. It assumes that the 
features in each class have independent Gaussian distributions with the 
same variance. 

The diagonal LDA classifier is often effective in high dimensional set¬ 
tings. It is also called the “independence rule” in Bickel and Levina (2004), 
who demonstrate theoretically that it will often outperform standard lin¬ 
ear discriminant analysis in high-dimensional problems. Here the diagonal 
LDA classifier yielded five misclassification errors for the 20 test samples. 
One drawback of the diagonal LDA classifier is that it uses all of the fea¬ 
tures (genes), and hence is not convenient for interpretation. With further 
regularization we can do better- both in terms of test error and inter- 
pretability. 

We would like to regularize in a way that automatically drops out fea¬ 
tures that are not contributing to the class predictions. We can do this 
by shrinking the classwise mean toward the overall mean, for each feature 
separately. The result is a regularized version of the nearest centroid clas¬ 
sifier, or equivalently a regularized version of the diagonal-covariance form 
of LDA. We call the procedure nearest shrunken centroids (NSC). 

The shrinkage procedure is defined as follows. Let 


dkj — 


•Kkj -Kj 
m k (sj + s 0 )’ 


(18.4) 


where Xj is the overall mean for gene j, mj: = 1 /Nk — 1/N and So is a 
small positive constant, typically chosen to be the median of the Sj values. 
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FIGURE 18.2. Soft thresholding function sign(a;)(|a;| — A)+ is shown in orange, 
along with the 45° line in red. 

This constant guards against large d k j values that arise from expression 
values near zero. With constant within-class variance a 2 , the variance of 
the contrast x k j — Xj in the numerator is ni^a 2 , and hence the form of the 
standardization in the denominator. We shrink the d k j toward zero using 
soft thresholding 

d' kj = sign(4 i )(|d fej -| - A)+; (18.5) 

see Figure 18.2. Here A is a parameter to be determined; we used 10-fold 
cross-validation in the example (see the top panel of Figure 18.4). Each dkj 
is reduced by an amount A in absolute value, and is set to zero if its value 
is less than zero. The soft-thresholding function is shown in Figure 18.2; 
the same thresholding is applied to wavelet coefficients in Section 5.9. An 
alternative is to use hard thresholding 

d' k j = dkj ■ I(\dkj\ > A); (18.6) 

we prefer soft-thresholding, as it is a smoother operation and typically 
works better. The shrunken versions of x k j are then obtained by reversing 
the transformation in (18.4): 

x' kj = Xj + m k (sj + s 0 )d' k j. (18.7) 

We then use the shrunken centroids x' kl in place of the original x k j in the 
discriminant score (18.2). The estimator (18.5) can also be viewed as a 
lasso-style estimator for the class means (Exercise 18.2). 

Notice that only the genes that have a nonzero d' k j for at least one of the 
classes play a role in the classification rule, and hence the vast majority 
of genes can often be discarded. In this example, all but 43 genes were 
discarded, leaving a small interpretable set of genes that characterize each 
class. Figure 18.3 represents the genes in a heatmap. 

Figure 18.4 (top panel) demonstrates the effectiveness of the shrinkage. 
With no shrinkage we make 5/20 errors on the test data, and several errors 
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on the training and CV data. The shrunken centroids achieve zero test er¬ 
rors for a fairly broad band of values for A. The bottom panel of Figure 18.4 
shows the four centroids for the SRBCT data (gray), relative to the overall 
centroid. The blue bars are shrunken versions of these centroids, obtained 
by soft-thresholding the gray bars, using A = 4.3. The discriminant scores 
(18.2) can be used to construct class probability estimates: 

i<5 fc (x’) 

^* ) = eP^' (18 ’ 8) 

These can be used to rate the classifications, or to decide not to classify a 
particular sample at all. 

Note that other forms of feature selection can be used in this setting, 
including hard thresholding. Fan and Fan (2008) show theoretically the 
importance of carrying out some kind of feature selection with diagonal 
linear discriminant analysis in high-dimensional problems. 


18.3 Linear Classifiers with Quadratic 
Regularization 

Ramaswamy et al. (2001) present a more difficult microarray classification 
problem, involving a training set of 144 patients with 14 different types of 
cancer, and a test set of 54 patients. Gene expression measurements were 
available for 16,063 genes. 

Table 18.1 shows the prediction results from eight different classification 
methods. The data from each patient was first standardized to have mean 
0 and variance 1; this seems to improve prediction accuracy overall this 
example, suggesting that the “shape” of each gene-expression profile is 
important, rather than the absolute expression levels. In each case, the 


BL EWS NB RMS 



FIGURE 18.3. Heat-map of the chosen f3 genes. Within each of the horizontal 
partitions, we have ordered the genes by hierarchical clustering, and similarly 
for the samples within each vertical partition. Yellow represents over- and blue 
under-expression. 
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Number of Genes 


2308 2059 1223 598 284 159 81 43 23 15 10 5 1 



Amount of Shrinkage A 



FIGURE 18.4. (Top): Error curves for the SRBCT data. Shown are the train¬ 
ing, 10-fold cross-validation, and test misclassification errors as the threshold 
parameter A is varied. The value A = 4.34 is chosen by CV, resulting in a sub¬ 
set of 43 selected genes. (Bottom): Four centroids profiles dkj for the SRBCT 
data (gray), relative to the overall centroid. Each centroid has 2308 components, 
and we see considerable noise. The blue bars are shrunken versions d' k j of these 
centroids, obtained by soft-thresholding the gray bars, using A = 4.3. 
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TABLE 18.1. Prediction results for microarray data with If cancer classes. 
Method 1 is described in Section 18.2. Methods 2, 3 and 6 are discussed in Sec¬ 
tion 18.3, while 4, 7 and 8 are discussed in Section 18.4- Method 5 is described in 
Section 13.3. The elastic-net penalized multinomial does the best on the test data, 
but the standard error of each test-error estimate is about 3, so such comparisons 
are inconclusive. 


Methods 

CV errors (SE) 
Out of 144 

Test errors 
Out of 54 

Number of 
Genes Used 

1. Nearest shrunken centroids 

35 (5.0) 

17 

6,520 

2. Z/ 2 -penalized discriminant 
analysis 

25 (4.1) 

12 

16,063 

3. Support vector classifier 

26 (4.2) 

14 

16,063 

4. Lasso regression (one vs all) 

30.7 (1.8) 

12.5 

1,429 

5. fc-nearest neighbors 

41 (4.6) 

26 

16,063 

6. Z/ 2 -penalized multinomial 

26 (4.2) 

15 

16,063 

7. Li-penalized multinomial 

17 (2.8) 

13 

269 

8. Elastic-net penalized 
multinomial 

22 (3.7) 

11.8 

384 


regularization parameter has been chosen to minimize the cross-validation 
error, and the test error at that value of the parameter is shown. When 
more than one value of the regularization parameter yields the minimal 
cross-validation error, the average test error at these values is reported. 

RDA (regularized discriminant analysis), regularized multinomial logistic 
regression, and the support vector machine are more complex methods that 
try to exploit multivariate information in the data. We describe each in 
turn, as well as a variety of regularization methods, including both L\ and 
L 2 and some in between. 

18.3.1 Regularized Discriminant Analysis 

Regularized discriminant analysis (RDA) is described in Section 4.3.1. Lin¬ 
ear discriminant analysis involves the inversion of a px p within-covariance 
matrix. When p N, this matrix can be huge, has rank at most N < p, 
and hence is singular. RDA overcomes the singularity issues by regulariz¬ 
ing the within-covariance estimate S. Here we use a version of RDA that 
shrinks £ towards its diagonal: 

£( 7 ) = 7 S -f (1 — 7 )diag(S), with 7 G [0,1], (18.9) 

Note that 7 = 0 corresponds to diagonal LDA, which is the “no shrinkage” 
version of nearest shrunken centroids. The form of shrinkage in (18.9) is 
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much like ridge regression (Section 3.4.1), which shrinks the total covariance 
matrix of the features towards a diagonal (scalar) matrix. In fact, viewing 
linear discriminant analysis as linear regression with optimal scoring of the 
categorical response [see (12.58) in Section 12.6], the equivalence becomes 
more precise. 

The computational burden of inverting this large pxp matrix is overcome 
using the methods discussed in Section 18.3.5. The value of 7 was chosen 
by cross-validation in line 2 of Table 18.1; all values of 7 € (0.002,0.550) 
gave the same CV and test error. Further development of RDA, including 
shrinkage of the centroids in addition to the covariance matrix, can be 
found in Guo et al. (2006). 


18.3.2 Logistic Regression with Quadratic Regularization 

Logistic regression (Section 4.4) can be modified in a similar way, to deal 
with the p 3> N case. With K classes, we use a symmetric version of the 
multiclass logistic model (4.17) on page 119: 


Pr(G = k\X = x ) 


exp(/3 fc0 + x T /3 k ) 
Ya =1 exp(ft 0 + x T /3e ) 


(18.10) 


This has K coefficient vectors of log-odds parameters /3 \, @ 2 , ■ ■ ■, Pk ■ We 
regularize the fitting by maximizing the penalized log-likelihood 


max 


" N 

log Pr (gi\xi) 

_i =1 




(18.11) 


This regularization automatically resolves the redundancy in the paramet- 
rization, and forces 1 Pkj =0) j = 1, • ■ • ,P (Exercise 18.3). Note that 
the constant terms f3ko are not regularized (and so one should be set to 
zero). The resulting optimization problem is convex, and can be solved by 
a Newton algorithm or other numerical techniques. Details are given in Zhu 
and Hastie (2004). Friedman et al. (2010) provide software for computing 
the regularization path for the two- and multiclass logistic regression mod¬ 
els. Table 18.1, line 6 reports the results for the multiclass logistic regres¬ 
sion model, referred to there as “multinomial”. It can be shown (Rosset 
et al., 2004a) that for separable data, as A —> 0, the regularized (two- 
class) logistic regression estimate (renormalized) converges to the maximal 
margin classifier (Section 12.2). This gives an attractive alternative to the 
support-vector machine, discussed next, especially in the multiclass case. 


18.3.3 The Support Vector Classifier 

The support vector classifier is described for the two-class case in Sec¬ 
tion 12.2. When p > TV, it is especially attractive because in general the 
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classes are perfectly separable by a hyperplane unless there are identical 
feature vectors in different classes. Without any regularization the support 
vector classifier finds the separating hyperplane with the largest margin; 
that is, the hyperplane yielding the biggest gap between the classes in 
the training data. Somewhat surprisingly, when p ^$> N the unregularized 
support vector classifier often works about as well as the best regularized 
version. Overfitting often does not seem to be a problem, partly because of 
the insensitivity of misclassihcation loss. 

There are many different methods for generalizing the two-class support- 
vector classifier to K > 2 classes. In the “one versus one” (ovo) approach, 
we compute all (([) pairwise classifiers. For each test point, the predicted 
class is the one that wins the most pairwise contests. In the “one versus all” 
(ova) approach, each class is compared to all of the others in K two-class 
comparisons. To classify a test point, we compute the confidences (signed 
distance from the hyperplane) for each of the K classifiers. The winner is the 
class with the highest confidence. Finally, Vapnik (1998) and Weston and 
Watkins (1999) suggested (somewhat complex) multiclass criteria which 
generalize the two-class criterion (12.6). 

Tibshirani and Hastie (2007) propose the margin tree classifier, in which 
support-vector classifiers are used in a binary tree, much as in CART 
(Chapter 9). The classes are organized in a hierarchical manner, which can 
be useful for classifying patients into different cancer types, for example. 

Line 3 of Table 18.1 shows the results for the support vector classifier 
using the OVA method; Ramaswamy et al. (2001) reported (and we con¬ 
firmed) that this approach worked best for this problem. The errors are 
very similar to those in line 6, as we might expect from the comments 
at the end of the previous section. The error rates are insensitive to the 
choice of C [the regularization parameter in (12.8) on page 420], for values 
of C > 0.001. Since p > N, the support vector hyperplane can perfectly 
separate the training data by setting C — oo. 


18.3.4 Feature Selection 

Feature selection is an important scientific requirement for a classifier when 
p is large. Neither discriminant analysis, logistic regression, nor the support- 
vector classifier perform feature selection automatically, because all use 
quadratic regularization. All features have nonzero weights in both models. 
Ad-hoc methods for feature selection have been proposed, for example, 
removing genes with small coefficients, and refitting the classifier. This is 
done in a backward stepwise manner, starting with the smallest weights and 
moving on to larger weights. This is known as recursive feature elimination 
(Guyon et ah, 2002). It was not successful in this example; Ramaswamy 
et al. (2001) report, for example, that the accuracy of the support-vector 
classifier starts to degrade as the number of genes is reduced from the full 
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set of 16, 063. This is rather remarkable, as the number of training samples 
is only 144. We do not have an explanation for this behavior. 

All three methods discussed in this section (RDA, LR and SVM) can 
be modified to fit nonlinear decision boundaries using kernels. Usually the 
motivation for such an approach is to increase the model complexity. With 
p T§> TV the models are already sufficiently complex and overfitting is always 
a danger. Yet despite the high dimensionality, radial kernels (Section 12.3.3) 
sometimes deliver superior results in these high dimensional problems. The 
radial kernel tends to dampen inner products between points far away from 
each other, which in turn leads to robustness to outliers. This occurs often 
in high dimensions, and may explain the positive results. We tried a radial 
kernel with the SVM in Table 18.1, but in this case the performance was 
inferior. 


18.3.5 Computational Shortcuts When N 

The computational techniques discussed in this section apply to any method 
that fits a linear model with quadratic regularization on the coefficients. 
That includes all the methods discussed in this section, and many more. 
When p > TV, the computations can be carried out in an TV-dimensional 
space, rather than p , via the singular value decomposition introduced in 
Section 14.5. Here is the geometric intuition: just like two points in three- 
dimensional space always lie on a line, TV points in p-dimensional space lie 
in an (TV — l)-dimensional affine subspace. 

Given the TV x p data matrix X, let 

X = UDV t (18.12) 

= RV t (18.13) 

be the singular-value decomposition (SVD) of X; that is, V is p x TV with 
orthonormal columns, U is TV x TV orthogonal, and D a diagonal matrix 
with elements d\ > d ,2 > (Tat > 0. The matrix R is TV x TV, with rows rj. 

As a simple example, let’s first consider the estimates from a ridge re¬ 
gression: 

/3 = (X T X + AI) -1 X T y. (18.14) 

Replacing X by RV T and after some further manipulations, this can be 
shown to equal 

p = V(R t R+ AI) _1 R r y (18.15) 

(Exercise 18.4). Thus $ = VO, where 9 is the ridge-regression estimate 
using the TV observations (r,,^), i = 1,2,..., TV. In other words, we can 
simply reduce the data matrix from X to R, and work with the rows of 
R. This trick reduces the computational cost from 0(p 3 ) to 0(pN 2 ) when 
p > TV. 
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These results can be generalized to all models that are linear in the 
parameters and have quadratic penalties. Consider any supervised learning 
problem where we use a linear function f(X) = fio + X T fi to model a 
parameter in the conditional distribution of Y\X. We fit the parameters (3 
by minimizing some loss function /( Xj)) over the data with a 

quadratic penalty on j3. Logistic regression is a useful example to have in 
mind. Then we have the following simple theorem: 

Let f*(rf) = 9 q + rf 9 with r, defined in (18.13), and consider the pair of 
optimization problems: 

N 

(Po, P) = ar g min Sf L( yi , f3 0 + xf (3) + \fi T (3; 

Po,pEIH.P . 

2=1 

N 

(00, 0) = arg min V L(y h 9 0 + rf 9) + A 9 T 9. 
e o ,0eTR N 

2=1 

Then the fio = 9 q, and fi = \9. 

The theorem says that we can simply replace the p vectors Xi by the 
IV-vectors and perform our penalized fit as before, but with far fewer 

predictors. The N- vector solution 9 is then transformed back to the p- 
vector solution via a simple matrix multiplication. This result is part of 
the statistics folklore, and deserves to be known more widely—see Hastie 
and Tibshirani (2004) for further details. 

Geometrically, we are rotating the features to a coordinate system in 
which all but the first N coordinates are zero. Such rotations are allowed 
since the quadratic penalty is invariant under rotations, and linear models 
are equivariant. 

This result can be applied to many of the learning methods discussed 
in this chapter, such as regularized (multiclass) logistic regression, linear 
discriminant analysis (Exercise 18.6), and support vector machines. It also 
applies to neural networks with quadratic regularization (Section 11.5.2). 
Note, however, that it does not apply to methods such as the lasso, which 
uses nonquadratic (L\) penalties on the coefficients. 

Typically we use cross-validation to select the parameter A. It can be 
seen (Exercise 18.12) that we only need to construct R once, on the original 
data, and use it as the data for each of the CV folds. 

The support vector “kernel trick” of Section 12.3.7 exploits the same re¬ 
duction used in this section, in a slightly different context. Suppose we have 
at our disposal the N x N gram (inner-product) matrix K = XX T . From 
(18.12) we have K = UD 2 U T , and so K captures the same information as 
R. Exercise 18.13 shows how we can exploit the ideas in this section to fit 
a ridged logistic regression with K using its SVD. 


(18.16) 

(18.17) 
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18.4 Linear Classifiers with L\ Regularization 


The methods of Section 18.3 use an L 2 penalty to regularize their pa¬ 
rameters, just as in ridge regression. All of the estimated coefficients are 
nonzero, and hence no feature selection is performed. In this section we dis¬ 
cuss methods that use L\ penalties instead, and hence provide automatic 
feature selection. 

Recall the lasso of Section 3.4.2, 

1 N , p ,2 p 

m i n 9 ^{yi-Po-^2 x ijP j ) +A^|/3jj, (18.18) 

which we have written in the Lagrange form (3.52). As discussed there, the 
use of the L 1 penalty causes a subset of the solution coefficients Bj to be 
exactly zero, for a sufficiently large value of the tuning parameter A. 

In Section 3.8.1 we discussed the LARS algorithm, an efficient procedure 
for computing the lasso solution for all A. When p > N (as in this chapter), 
as A approaches zero, the lasso fits the training data exactly. In fact, by 
convex duality one can show that when p > N the number of non-zero 
coefficients is at most N for all values of A (Rosset and Zhu, 2007, for 
example). Thus the lasso provides a (severe) form of feature selection. 

Lasso regression can be applied to a two-class classification problem by 
coding the outcome ±1, and applying a cutoff (usually 0) to the predictions. 
For more than two classes, there are many possible approaches, including 
the OVA and OVO methods discussed in Section 18.3.3. We tried the OVA- 
approach on the cancer data in Section 18.3. The results are shown in 
line (4) of Table 18.1. Its performance is among the best. 

A more natural approach for classification problems is to use the lasso 
penalty to regularize logistic regression. Several implementations have been 
proposed in the literature, including path algorithms similar to LARS (Park 
and Hastie, 2007). Because the paths are piecewise smooth but nonlinear, 
exact methods are slower than the LARS algorithm, and are less feasible 
when p is large. 

Friedman et al. (2010) provide very fast algorithms for fitting Li-pen- 
alized logistic and multinomial regression models. They use the symmetric 
multinomial logistic regression model as in (18.10) in Section 18.3.2, and 
maximize the penalized log-likelihood 


max 


N 

log Pr(g i \xi) 

i =1 


K p 


A EE i&ii 


(18.19) 


compare with (18.11). Their algorithm computes the exact solution at a 
pre-chosen sequence of values for A by cyclical coordinate descent (Sec¬ 
tion 3.8.6), and exploits the fact that solutions are sparse when p N, 
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as well as the fact that solutions for neighboring values of A tend to be 
very similar. This method was used in line (7) of Table 18.1, with the over¬ 
all tuning parameter A chosen by cross-validation. The performance was 
similar to that of the best methods, except here the automatic feature se¬ 
lection chose 269 genes altogether. A similar approach is used in Genkin 
et al. (2007); although they present their model from a Bayesian point of 
view, they in fact compute the posterior mode, which solves the penalized 
maximum-likelihood problem. 


Lasso 


Elastic Net 




log A) 


log(A) 


FIGURE 18.5. Regularized logistic regression paths for the leukemia data. The 
left panel is the lasso path, the right panel the elastic-net path with a = 0.8. At 
the ends of the path (extreme left), there are 19 nonzero coefficients for the lasso, 
and 39 for the elastic net. The averaging effect of the elastic net results in more 
non-zero coefficients than the lasso, but with smaller magnitudes. 


In genomic applications, there are often strong correlations among the 
variables; genes tend to operate in molecular pathways. The lasso penalty 
is somewhat indifferent to the choice among a set of strong but corre¬ 
lated variables (Exercise 3.28). The ridge penalty, on the other hand, tends 
to shrink the coefficients of correlated variables toward each other (Exer¬ 
cise 3.29 on page 99). The elastic net penalty (Zou and Hastie, 2005) is a 
compromise, and has the form 

£(a|&| + (l-a)$). (18.20) 

i=i 

The second term encourages highly correlated features to be averaged, while 
the first term encourages a sparse solution in the coefficients of these aver- 
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aged features. The elastic net penalty can be used with any linear model, 
in particular for regression or classification. 

Hence the multinomial problem above with elastic-net penalty becomes 


max 

{/3 0fe> /3 fc eIRP}f 


N 


K p 


X^logPrfeki) - X J2J2( a \P k i\ + t 1 


1=1 


k=lj—1 



(18.21) 

The parameter a determines the mix of the penalties, and is often pre¬ 
chosen on qualitative grounds. The elastic net can yield more that N non¬ 
zero coefficients when p > N, a potential advantage over the lasso. Line 
(8) in Table 18.1 uses this model, with a and A chosen by cross-validation. 
We used a sequence of 20 values of a between 0.05 and 1.0, and a 100 
values of A uniform on the log scale covering the entire range. Values of 
at [0.75,0.80] gave the minimum CV error, with values of A < 0.001 for all 
tied solutions. Although it has the lowest test error among all methods, the 
margin is small and not significant. Interestingly, when CV is performed 
separately for each value of a, a minimum test error of 8.8 is achieved at 
a = 0.10, but this is not the value chosen in the two-dimensional CV. 


- Training 
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- 10-fold CV 
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FIGURE 18.6. Training, test, and 10-fold cross validation curves for lasso logis¬ 
tic regression on the leukemia data. The left panel shows mis classification errors, 
the right panel shows deviance. 

Figure 18.5 shows the lasso and elastic-net coefficient paths on the two- 
class leukemia data (Golub et al., 1999). There are 7129 gene-expression 
measurements on 38 samples, 27 of them in class ALL (acute lymphocytic 
leukemia), and 11 in class AML (acute myelogenous leukemia). There is 
also a test set with 34 samples (20, 14). Since the data are linearly separa¬ 
ble, the solution is undefined at A = 0 (Exercise 18.11), and degrades for 
very small values of A. Hence the paths have been truncated as the fitted 
probabilities approach 0 and 1. There are 19 non-zero coefficients in the 
left plot, and 39 in the right. Figure 18.6 (left panel) shows the misclas- 
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sification errors for the lasso logistic regression on the training and test 
data, as well as for 10-fold cross-validation on the training data. The right 
panel uses binomial deviance to measure errors, and is much smoother. The 
small sample sizes lead to considerable sampling variance in these curves, 
even though individual curves are relatively smooth (see, for example, Fig¬ 
ure 7.1 on page 220). Both of these plots suggest that the limiting solution 
A 0 is adequate, leading to 3/34 misclassifications in the test set. The 
corresponding figures for the elastic net are qualitatively similar and are 
not shown. 

For p ^ N, the limiting coefficients diverge for all regularized logistic 
regression models, so in practical software implementations a minimum 
value for A > 0 is either explicitly or implicitly set. However, renormalized 
versions of the coefficients converge, and these limiting solutions can be 
thought of as interesting alternatives to the linear optimal separating hy¬ 
perplane (SVM). With a = 0 the limiting solution coincides with the SVM 
(see end of Section 18.3.2), but all the 7129 genes are selected. With a = 1, 
the limiting solution coincides with an L\ separating hyperplane (Rosset 
et al., 2004a), and includes at most 38 genes. As a decreases from 1, the 
elastic-net solutions include more genes in the separating hyperplane. 

18.4-1 Application of Lasso to Protein Mass Spectroscopy 

Protein mass spectrometry has become a popular technology for analyzing 
the proteins in blood, and can be used to diagnose a disease or understand 
the processes underlying it. 

For each blood serum sample i, we observe the intensity Xjj for many 
time of flight values tj. This intensity is related to the number of particles 
observed to take approximately tj time to pass from the emitter to the 
detector during a cycle of operation of the machine. The time of flight has 
a known relationship to the mass over charge ratio ( m/z ) of the constituent 
proteins in the blood. Hence the identification of a peak in the spectrum 
at a certain tj tells us that there is a protein with a corresponding mass 
and charge. The identity of this protein can then be determined by other 
means. 

Figure 18.7 shows an example taken from Adam et al. (2003). It shows 
the average spectra for healthy patients and those with prostate cancer. 
There are 16,898 m/z sites in total, ranging in value from 2000 to 40,000. 
The full dataset consists of 157 healthy patients and 167 with cancer, and 
the goal is to find m/z sites that discriminate between the two groups. 
This is an example of functional data; the predictors can be viewed as a 
function of m/z. There has been much interest in this problem in the past 
few years; see e.g. Petricoin et al. (2002). 

The data were first standardized (baseline subtraction and normaliza¬ 
tion), and we restricted attention to m/z values between 2000 and 40,000 
(spectra outside of this range were not of interest). We then applied near- 
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FIGURE 18.7. Protein mass spectrometry data: average profiles from normal 
and prostate cancer patients. 

est shrunken centroids and lasso regression to the data, with the results for 
both methods shown in Table 18.2. 

By fitting harder to the data, the lasso achieves a considerably lower 
test error rate. However, it may not provide a scientifically useful solu¬ 
tion. Ideally, protein mass spectrometry resolves a biological sample into 
its constituent proteins, and these should appear as peaks in the spectra. 
The lasso doesn’t treat peaks in any special way, so not surprisingly only 
some of the non-zero lasso weights were situated near peaks in the spectra. 
Furthermore, the same protein may yield a peak at slightly different m/z 
values in different spectra. In order to identify common peaks, some kind 
of to /z warping is needed from sample to sample. 

To address this, we applied a standard peak-extraction algorithm to each 
spectrum, yielding a total of 5178 peaks in the 217 training spectra. Our 
idea was to pool the collection of peaks from all patients, and hence con¬ 
struct a set of common peaks. For this purpose, we applied hierarchical 
clustering to the positions of these peaks along the log m/z axis. We cut 
the resulting dendrogram horizontally at height log(0.005) 3 , and computed 
averages of the peak positions in each resulting cluster. This process yielded 
728 common clusters and their corresponding peak centers. 

Given these 728 common peaks, we determined which of these were 
present in each individual spectrum, and if present, the height of the peak. 
A peak height of zero was assigned if that peak was not found. This pro¬ 
duced a 217 x 728 matrix of peak heights as features, which was used in a 
lasso regression. We scored the test spectra for the same 728 peaks. 


3 Use of the value 0.005 means that peaks with positions less than 0.5% apart are 
considered the same peak, a fairly common assumption. 
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TABLE 18.2. Results for the prostate data example. The standard deviation for 
the test errors is about 4.5. 


Method 

Test Errors/108 

Number of Sites 

1. Nearest shrunken centroids 

34 

459 

2. Lasso 

22 

113 

3. Lasso on peaks 

28 

35 


The prediction results for this application of the lasso to the peaks are 
shown in the last line of Table 18.2: it does fairly well, but not as well 
as the lasso on the raw spectra. However, the fitted model may be more 
useful to the biologist as it yields 35 peak positions for further study. On 
the other hand, the results suggest that there may be useful discriminatory 
information between the peaks of the spectra, and the positions of the lasso 
sites from line (2) of the table also deserve further examination. 


18-4-2 The Fused Lasso for Functional Data 

In the previous example, the features had a natural order, determined by 
the mass-to-charge ratio m/z. More generally, we may have functional fea¬ 
tures Xi(t) that are ordered according to some index variable t. We have 
already discussed several approaches for exploiting such structure. 

We can represent Xi(t) by their coefficients in a basis of functions in t, 
such as splines, wavelets or Fourier bases, and then apply a regression using 
these coefficients as predictors. Equivalently, one can instead represent the 
coefficients of the original features in these bases. These approaches are 
described in Section 5.3. 

In the classification setting, we discuss the analogous approach of penal¬ 
ized discriminant analysis in Section 12.6. This uses a penalty that explicitly 
controls the resulting smoothness of the coefficient vector. 

The above methods tend to smooth the coefficients uniformly. Here we 
present a more adaptive strategy that modifies the lasso penalty to take 
into account the ordering of the features. The fused lasso (Tibshirani et 
al., 2005) solves 

{ N p p p —1 ^ 

0-J2 X i3pjf + A 1 J2 \Pi I + A 2 J2 I0J+1 -Pj\Y ( 18 - 22 ) 

*= 1 3 =1 j=1 3=1 J 

This criterion is strictly convex in /?, so a unique solution exists. The first 
penalty encourages the solution to be sparse, while the second encourages 
it to be smooth in the index j. 

The difference penalty in (18.22) assumes an uniformly spaced index j. If 
instead the underlying index variable t has nonuniform values tj , a natural 
generalization of (18.22) would be based on divided differences 
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FIGURE 18.8. Fused lasso applied to CGH data. Each point represents the 
copy-number of a gene in a tumor sample, relative to that of a control (on the log 
base-2 scale). 


p -1 

3 =1 


l/^J + 1 fij I 

\tj+l ~ tj | 


(18.23) 


This amounts to having a penalty modifier for each of the terms in the 
series. 

A particularly useful special case arises when the predictor matrix X = 
Ijv, the N x N identity matrix. This is a special case of the fused lasso, 
used to approximate a sequence {yi\i ■ The fused lasso signal approximator 
solves 


min 

^eiR w 


N 


N 


N -1 


E^* ~ A) — Pi) 2 + Al \Pi\ + A 2 E IA+1 ~ Pi\ 


(18.24) 


Figure 18.8 shows an example taken from Tibshirani and Wang (2007). The 
data in the panel come from a Comparative Genomic Hybridization (CGH) 
array, measuring the approximate log (base-two) ratio of the number of 
copies of each gene in a tumor sample, as compared to a normal sample. 
The horizontal axis represents the chromosomal location of each gene. The 
idea is that in cancer cells, genes are often amplified (duplicated) or deleted, 
and it is of interest to detect these events. Furthermore, these events tend 
to occur in contiguous regions. The smoothed signal estimate from the 
fused lasso signal approximator is shown in dark red (with appropriately 
chosen values for Ai and A 2 ). The significantly nonzero regions can be used 
to detect locations of gains and losses of genes in the tumor. 

There is also a two-dimensional version of the fused lasso, in which the 
parameters are laid out in a grid of pixels, and a penalty is applied to the 
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first differences to the left, right, above and below the target pixel. This 
can be useful for denoising or classifying images. Friedman et al. (2007) 
develop fast generalized coordinate descent algorithms for the one- and 
two-dimensional fused lasso. 


18.5 Classification When Features are Unavailable 

In some applications the objects under study are more abstract in nature, 
and it is not obvious how to define a feature vector. As long as we can fill 
in anlVxlV proximity matrix of similarities between pairs of objects in our 
database, it turns out we can put to use many of the classifiers in our arsenal 
by interpreting the proximities as inner-products. Protein structures fall 
into this category, and we explore an example in Section 18.5.1 below. 

In other applications, such as document classification, feature vectors are 
available but can be extremely high-dimensional. Here we may not wish 
to compute with such high-dimensional data, but rather store the inner- 
products between pairs of documents. Often these inner-products can be 
approximated by sampling techniques. 

Pairwise distances serve a similar purpose, because they can be turned 
into centered inner-products. Proximity matrices are discussed in more de¬ 
tail in Chapter 14. 


18.5.1 Example: String Kernels and Protein Classification 

An important problem in computational biology is to classify proteins into 
functional and structural classes based on their sequence similarities. Pro¬ 
tein molecules are strings of amino acids, differing in both length and com¬ 
position. In the example we consider, the lengths vary between 75-160 
amino-acid molecules, each of which can be one of 20 different types, labeled 
using letters. Here are two examples, of length 110 and 153, respectively: 

IPTSALVKETLALLSTHRTLLIANETLRIPVPVHKNHQLCTEEIFQGIGTLESQTVQGGTV 

ERLFKNLSLIKKYIDGQKKKCGEERRRVNQFLDYLQEFLGVMNTEWI 

PHRRDLCSRSIWLARKIRSDLTALTESYVKHQGLWSELTEAERLQENLQAYRTFHVLLA 

RLLEDQQVHFTPTEGDFHQAIHTLLLQVAAFAYQIEELMILLEYKIPRNEADGMLFEKK 

LWGLKVLQELSQWTVRSIHDLRFISSHQTGIP 

There have been many proposals for measuring the similarity between a 
pair of protein molecules. Here we focus on a measure based on the count 
of matching substrings (Leslie et al., 2004), such as the LQE above. 

To construct our features, we count the number of times that a given 
sequence of length m occurs in our string, and we compute this number 
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for all possible sequences of length m. Formally, for a string x, we define a 
feature map 


$m(a;) = {&a(x)} a& Am 


(18.25) 


where A rn is the set of subsequences of length to, and 4> a {x) is the number 
of times that “a” occurs in our string x. Using this, we define the inner 
product 


K m ( xi,x 2 ) = (&m(x l) 1 


(18.26) 


which measures the similarity between the two strings x±, X 2 ■ This can be 
used to drive, for example, a support vector classifier for classifying strings 
into different protein classes. 

Now the number of possible sequences a is \A m \ = 20 m , which can be 
very large for moderate to, and the vast majority of the subsequences do 
not match the strings in our training set. It turns out that we can compute 
the N x N inner-product matrix or string kernel K m (18.26) efficiently 
using tree-structures, without actually computing the individual vectors. 
This methodology, and the data to follow, come from Leslie et al. (2004). 4 

The data consist of 1708 proteins in two classes— negative (1663) and 
positive (45). The two examples above, which we will call “aq” and “aq”, 
are from this set. We have marked the occurrences of subsequence LQE, 
which appears in both proteins. There are 20 3 possible subsequences, so 
$ 3 ( 2 ;) will be a vector of length 8000. For this example (/)lqe{x 1 ) = 1 and 
4>lqe(x 2 ) = 2 . 

Using software from Leslie et al. (2004), we computed the string kernel 
for to = 4, which was then used in a support vector classifier to find the 
maximal margin solution in this 20 4 = 160, 000 -dimensional feature space. 
We used 10-fold cross-validation to compute the SVM predictions on all of 
the training data. The orange curve in Figure 18.9 shows the cross-validated 
ROC curve for the support vector classifier, computed by varying the cut- 
point on the real-valued predictions from the cross-validated support vector 
classifier. The area under the curve is 0.84. Leslie et al. (2004) show that 
the string kernel method is competitive with, but perhaps not as accurate 
as, more specialized methods for protein string matching. 

Many other classifiers can be computed using only the information in the 
kernel matrix; some details are given in the next section. The results for 
the nearest centroid classifier (green), and distance-weighted one-nearest 
neighbors (blue) are shown in Figure 18.9. Their performance is similar to 
that of the support vector classifier. 


We thank Christina Leslie for her help and for providing the data, which is available 
on our book website. 
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ROC Curves for String Kernel 
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FIGURE 18.9. Cross-validated ROC curves for protein example using the string 
kernel. The numbers next to each method in the legend give the area under the 
curve, an overall measure of accuracy. The SVM achieves better sensitivities than 
the other two, which achieve better specificities. 

18.5.2 Classification and Other Models Using Inner-Product 
Kernels and Pairwise Distances 

There are a number of other classifiers, besides the support-vector ma¬ 
chine, that can be implemented using only inner-product matrices. This 
also implies they can be “kernelized” like the SVM. 

An obvious example is nearest-neighbor classification, since we can trans¬ 
form pairwise inner-products to pairwise distances: 


11 Si Xi'W — “t" {Xi'^Xi'^j 2(3^; Xi’^J. (18.27) 


A variation of 1-NN classification is used in Figure 18.9, which produces 
a continuous discriminant score needed to construct a ROC curve. This 
distance-weighted 1-NN makes use of the distance of a test points to the 
closest member of each class; see Exercise 18.14. 

Nearest-centroid classification follows easily as well. For training pairs 
( Xi,gi ), i = 1,..., N, a test point Xo , and class centroids Xk, k = 1,..., K 
we can write 



(18.28) 
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Hence we can compute the distance of the test point to each of the cen¬ 
troids, and perform nearest centroid classification. This also implies that 
methods like K-means clustering can also be implemented, using only the 
inner products of the data points. 

Logistic and multinomial regression with quadratic regularization can 
also be implemented with inner-product kernels; see Section 12.3.3 and 
Exercise 18.13. Exercise 12.10 derives linear discriminant analysis using an 
inner-product kernel. 

Principal components can be computed using inner-product kernels as 
well; since this is frequently useful, we give some details. Suppose first 
that we have a centered data matrix X, and let X = UDV t be its SVD 
(18.12). Then Z = UD is the matrix of principal component variables (see 
Section 14.5.1). But if K = XX T , then it follows that K = UD 2 U r , and 
hence we can compute Z from the eigen decomposition of K. If X is not 
centered, then we can center it using X = (I — M)X, where M = ill 1 
is the mean operator. Thus we compute the eigenvectors of the double- 
centered, kernel (I — M)K(I — M) for the principal components from an 
uncentered inner-product matrix. Exercise 18.15 explores this further, and 
Section 14.5.4 discusses in more detail kernel PCA for general kernels, such 
as the radial kernel used in SVMs. 

If instead we had available only the pairwise (squared) Euclidean dis¬ 
tances between observations, 



(18.29) 


it turns out we can do all of the above as well. The trick is to convert the 
pairwise distances to centered inner-products, and then proceed as before. 
We write 


A %, = || Xi - x\\ 2 + ||xi- - x\\ 2 - 2(x i - x,xy - x). 
Defining B = {—Af^/2}, we double center B: 

K = (I - M)B(I- M); 


(18.30) 


(18.31) 


it is easy to check that Kn' = {%i — xi> — x), the centered inner-product 
matrix. 

Distances and inner-products also allow us to compute the medoid in each 
class—the observation with smallest average distance to other observations 
in that class. This can be used for classification (closest medoids), as well as 
to drive fc-medoids clustering (Section 14.3.10). With abstract data objects 
like proteins, medoids have a practical advantage over means. The medoid is 
one of the training examples, and can be displayed. We tried closest medoids 
in the example in the next section (see Table 18.3), and its performance is 
disappointing. 

It is useful to consider what we cannot do with inner-product kernels and 
distances: 
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TABLE 18.3. Cross-validated error rates for the abstracts example. The nearest 
shrunken centroids ended up using no-shrinkage, but does use a word-by-word 
standardization (section 18.2). This standardization gives it a distinct advantage 
over the other methods. 



Method 

CV Error (SE) 

1 . 

Nearest shrunken centroids 

0.17 (0.05) 

2. 

SVM 

0.23 (0.06) 

3. 

Nearest medoids 

0.65 (0.07) 

4. 

1-NN 

0.44 (0.07) 

5. 

Nearest centroids 

0.29 (0.07) 


• We cannot standardize the variables; standardization significantly im¬ 
proves performance in the example in the next section. 

• We cannot assess directly the contributions of individual variables. 
In particular, we cannot perform individual t-tests, fit the nearest 
shrunken centroids model, or fit any model that uses the lasso penalty. 

• We cannot separate the good variables from the noise: all variables get 
an equal say. If, as is often the case, the ratio of relevant to irrelevant 
variables is small, methods that use kernels are not likely to work as 
well as methods that do feature selection. 

18.5.3 Example: Abstracts Classification 

This somewhat whimsical example serves to illustrate a limitation of ker¬ 
nel approaches. We collected the abstracts from 48 papers, 16 each from 
Bradley Efron (BE), Trevor Hastie and Rob Tibshirani (HT) (frequent co¬ 
authors), and Jerome Friedman (JF). We extracted all unique words from 
these abstracts, and defined features Xij to be the number of times word 
j appears in abstract i. This is the so-called bag of words representation. 
Quotations, parentheses and special characters were first removed from the 
abstracts, and all characters were converted to lower case. We also removed 
the word “we”, which could unfairly discriminate HT abstracts from the 
others. 

There were 4492 total words, of which p = 1310 were unique. We sought 
to classify the documents into BE, HT or JF on the basis of the features 
x^. Although it is artificial, this example allows us to assess the possible 
degradation in performance if information specific to the raw features is 
not used. 

We first applied the nearest shrunken centroid classifier to the data, using 
10-fold cross-validation. It essentially chose no shrinkage, and so used all the 
features; see the first line of Table 18.3. The error rate is 17%; the number 
of features can be reduced to about 500 without much loss in accuracy. 
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Note that the nearest shrunken classifier requires the raw feature matrix 
X in order to standardize the features individually. Figure 18.10 shows the 
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FIGURE 18.10. Abstracts example: top 20 scores from nearest shrunken cen¬ 
troids. Each score is the standardized difference in frequency for the word in the 
given class (BE, HT or JF) versus all classes. Thus a positive score (to the right 
of the vertical grey zero lines) indicates a higher frequency in that class; a negative 
score indicates a lower relative frequency. 


top 20 discriminating words, with a positive score indicating that a word 
appears more in that class than in the other classes. 

Some of these terms make sense: for example “frequentist” and “Bayesian” 
reflect Efron’s greater emphasis on statistical inference. However, many oth¬ 
ers are surprising, and reflect personal writing styles: for example, Fried¬ 
man’s use of “presented” and HT’s use of “propose”. 

We then applied the support vector classifier with linear kernel and no 
regularization, using the “all pairs” (ovo) method to handle the three 
classes (regularization of the SVM did not improve its performance). The 
result is shown in Table 18.3. It does somewhat worse than the nearest 
shrunken centroid classifier. 

As mentioned, the first line of Table 18.3 represents nearest shrunken cen¬ 
troids (with no shrinkage). Denote by Sj the pooled within-class standard 
deviation for feature j, and so the median of the Sj values. Then line (1) 
also corresponds to nearest centroid classification, after first standardizing 
each feature by Sj + sq [recall (18.4) on page 652]. 

Line (3) shows that the performance of nearest medoids is very poor, 
something which surprised us. It is perhaps due to the small sample sizes 
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and high dimensions, with medoids having much higher variance than 
means. The performance of the one-nearest neighbor classifier is also poor. 

The performance of the nearest centroid classifier is also shown in Ta¬ 
ble 18.3 in line (5): it is better than nearest medoids, but worse than that 
of nearest shrunken centroids, even with no shrinkage. The difference seems 
to be the standardization of each feature that is done in nearest shrunken 
centroids. This standardization is important here, and requires access to 
the individual feature values. Nearest centroids uses a spherical metric, and 
relies on the fact that the features are in similar units. The support vector 
machine estimates a linear combination of the features and can better deal 
with unstandardized features. 


18.6 High-Dimensional Regression: Supervised 
Principal Components 

In this section we describe a simple approach to regression and generalized 
regression that is especially useful when p> N. We illustrate the method 
on another microarray data example. The data is taken from Rosenwald 
et al. (2002) and consists of 240 samples from patients with diffuse large 
B-cell lymphoma (DLBCL), with gene expression measurements for 7399 
genes. The outcome is survival time, either observed or right censored. We 
randomly divided the lymphoma samples into a training set of size 160 and 
a test set of size 80. 

Although supervised principal components is useful for linear regression, 
its most interesting applications may be in survival studies, which is the 
focus of this example. 

We have not yet discussed regression with censored survival data in this 
book; it represents a generalized form of regression in which the outcome 
variable (survival time) is only partly observed for some individuals. Sup¬ 
pose for example we carry out a medical study that lasts for 365 days, and 
for simplicity all subjects are recruited on day one. We might observe one 
individual to die 200 days after the start of the study. Another individ¬ 
ual might still be alive at 365 days when the study ends. This individual 
is said to be “right censored” at 365 days. We know only that he or she 
lived at least 365 days. Although we do not know how long past 365 days 
the individual actually lived, the censored observation is still informative. 
This is illustrated in Figure 18.11. Figure 18.12 shows the survival curve 
estimated by the Kaplan-Meier method for the 80 patients in the test set. 
See for example Kalbfleisch and Prentice (1980) for a description of the 
Kaplan-Meier method. 

Our objective in this example is to find a set of features (genes) that 
can predict the survival of an independent set of patients. This could be 
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FIGURE 18.11. Censored survival data. For illustration there are four patients. 
The first and third patients die before the study ends. The second patient is alive 
at the end of the study (365 days), while the fourth patient is lost to follow-up 
before the study ends. For example, this patient might have moved out of the 
country. The survival times for patients two and four are said to be “censored. ” 


Survival Function 



FIGURE 18.12. Lymphoma data. The Kaplan-Meier estimate of the survival 
function for the 80 patients in the test set, along with one-standard-error curves. 
The curve estimates the probability of surviving past t months. The ticks indicate 
censored observations. 
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FIGURE 18.13. Underlying conceptual model for supervised principal compo¬ 
nents. There are two cell types, and patients with the good cell type live longer on 
the average. Supervised principal components estimate the cell type, by averaging 
the expression of genes that reflect it. 


useful as a prognostic indicator to aid in choosing treatments, or to help 
understand the biological basis for the disease. 

The underlying conceptual model for supervised principal components 
is shown in Figure 18.13. We imagine that there are two cell types, and 
patients with the good cell type live longer on the average. However there 
is considerable overlap in the two sets of survival times. We might think 
of survival time as a “noisy surrogate” for cell type. A fully supervised 
approach would give the most weight to those genes having the strongest 
relationship with survival. These genes are partially, but not perfectly, re¬ 
lated to cell type. If we could instead discover the underlying cell types of 
the patients, often reflected by a sizable signature of genes acting together 
in pathways, then we might do a better job of predicting patient survival. 

Although the cell type in Figure 18.13 is discrete, it is useful to imagine 
a continuous cell type, define by some linear combination of the features. 
We will estimate the cell type as a continuous quantity, and then discretize 
it for display and interpretation. 

How can we find the linear combination that defines the important under¬ 
lying cell types? Principal components analysis (Section 14.5) is an effective 
method for finding linear combinations of features that exhibit large varia¬ 
tion in a dataset. But what we seek here are linear combinations with both 
high variance and significant correlation with the outcome. The lower right 
panel of Figure 18.14 shows the result of applying standard principal com¬ 
ponents in this example; the leading component does not correlate strongly 
with survival (details are given in the figure caption). 

Hence we want to encourage principal component analysis to find linear 
combinations of features that have high correlation with the outcome. To 
do this, we restrict attention to features which by themselves have a siz¬ 
able correlation with the outcome. This is summarized in the supervised 
principal components Algorithm 18.1, and illustrated in Figure 18.14. 

The details in steps (1) and (2b) will depend on the type of outcome 
variable. For a standard regression problem, we use the univariate linear 
least squares coefficients in step (1) and a linear least squares model in 
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FIGURE 18.14. Supervised principal components on the lymphoma data. The 
left panel shows a heatmap of a subset of the gene-expression training data. The 
rows are ordered by the magnitude of the univariate Cox-score, shown in the mid¬ 
dle vertical column. The top 50 and bottom 50 genes are shown. The supervised 
principal component uses the top 27 genes (chosen by 10 -fold CV). It is repre¬ 
sented by the bar at the top of the heatmap, and is used to order the columns 
of the expression matrix. In addition, each row is multiplied by the sign of the 
Cox-score. The middle panel on the right shows the survival curves on the test 
data when we create a low and high group by splitting this supervised PC at zero 
(training data mean). The curves are well separated, as indicated by the p-value 
for the log-rank test. The top panel does the same, using the top-scoring gene on 
the training data. The curves are somewhat separated, but not significantly. The 
bottom panel uses the first principal component on all the genes, and the separa¬ 
tion is also poor. Each of the top genes can be interpreted as noisy surrogates for 
a latent underlying cell-type characteristic, and supervised principal components 
uses them all to estimate this latent factor. 
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Algorithm 18.1 Supervised Principal Components. 

1. Compute the standardized univariate regression coefficients for the 
outcome as a function of each feature separately. 

2. For each value of the threshold 9 from the list 0 < 9\ < 82 < ■ ■ ■ < 9 k ■ 

(a) Form a reduced data matrix consisting of only those features 
whose univariate coefficient exceeds 8 in absolute value, and 
compute the first m principal components of this matrix. 

(b) Use these principal components in a regression model to predict 
the outcome. 

3. Pick 9 (and m) by cross-validation. 


step (2b). For survival problems, Cox’s proportional hazards regression 
model is widely used; hence we use the score test from this model in step (1) 
and the multivariate Cox model in step (2b). The details are not essential 
for understanding the basic method; they may be found in Bair et al. (2006). 

Figure 18.14 shows the results of supervised principal components in this 
example. We used a Cox-score cutoff of 3.53, yielding 27 genes, where the 
value 3.53 was found through 10-fold cross-validation. We then computed 
the first principal component (m = 1) using just this subset of the data, 
as well as its value for each of the test observations. We included this as 
a quantitative predictor in a Cox regression model, and its likelihood-ratio 
significance was p = 0.005. When dichotomized (using the mean score on 
the training data as a threshold), it clearly separates the patients in the 
test set into low and high risk groups (middle-right panel of Figure 18.14, 

p = 0.006). 

The top-right panel of Figure 18.14 uses the top scoring gene (dichot¬ 
omized) alone as a predictor of survival. It is not significant on the test set. 
Likewise, the lower-right panel shows the dichotomized principal compo¬ 
nent using all the training data, which is also not significant. 

Our procedure allows m > 1 principal components in step (2a). However, 
the supervision in step (1) encourages the principal components to align 
with the outcome, and thus in most cases only the first or first few com¬ 
ponents tend to be useful for prediction. In the mathematical development 
below, we consider only the first component, but extensions to more than 
one component can be derived in a similar way. 

18.6.1 Connection to Latent-Variable Modeling 

A formal connection between supervised principal components and the un¬ 
derlying cell type model (Figure 18.13) can be seen through a latent variable 
model for the data. Suppose we have a response variable Y which is related 
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to an underlying latent variable U by a linear model 


Y = p 0 + PiU + e. (18.32) 

In addition, we have measurements on a set of features Xj indexed by j £ V 
(for pathway), for which 


Xj — ctQj H- OL\jU T tj , j £ P . (18.33) 

The errors e and €j are assumed to have mean zero and are independent of 
all other random variables in their respective models. 

We also have many additional features Xj~ ,k^P which are independent 
of U. We would like to identify P, estimate U , and hence fit the predic¬ 
tion model (18.32). This is a special case of a latent-structure model, or 
single-component factor-analysis model (Mardia et ah, 1979, see also Sec¬ 
tion 14.7). The latent factor U is a continuous version of the cell type 
conceptualized in Figure 18.13. 

The supervised principal component algorithm can be seen as a method 
for fitting this model: 

• The screening step (1) estimates the set P. 

• Given P, the largest principal component in step (2a) estimates the 
latent factor U. 

• Finally, the regression fit in step (2b) estimates the coefficient in 
model (18.32). 

Step (1) is natural, since on average the regression coefficient is nonzero 
only if a±j is non-zero. Hence this step should select the features j £ P. 
Step (2a) is natural if we assume that the errors ej have a Gaussian dis¬ 
tribution, with the same variance. In this case the principal component is 
the maximum likelihood estimate for the single factor model (Mardia et 
al., 1979). The regression in (2b) is an obvious final step. 

Suppose there are a total of p features, with p\ features in the relevant set 
P. Then if p and p\ grow but p\ is small relative to p , one can show (under 
reasonable conditions) that the leading supervised principal component 
is consistent for the underlying latent factor. The usual leading principal 
component may not be consistent, since it can be contaminated by the 
presence of a large number of “noise” features. 

Finally, suppose that the threshold used in step (1) of the supervised 
principal component procedure yields a large number of features for com¬ 
putation of the principal component. Then for interpretational purposes, as 
well as for practical uses, we would like some way of finding a reduced a set 
of features that approximates the model. Pre-conditioning (Section 18.6.3) 
is one way of doing this. 
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18.6.2 Relationship with Partial Least Squares 

Supervised principal components is closely related to partial least squares 
regression (Section 3.5.2). Bair et al. (2006) found that the key to the good 
performance of supervised principal components was the filtering out of 
noisy features in step (2a). Partial least squares (Section 3.5.2) downweights 
noisy features, but does not throw them away; as a result a large number 
of noisy features can contaminate the predictions. However, a modification 
of the partial least squares procedure has been proposed that has a similar 
flavor to supervised principal components [Brown et al. (1991),Nadler and 
Coifman (2005), for example]. We select the features as in steps (1) and 
(2a) of supervised principal components, but then apply PLS (rather than 
principal components) to these features. For our current discussion, we call 
this “thresholded PLS.” 

Thresholded PLS can be viewed as a noisy version of supervised principal 
components, and hence we might not expect it to work as well in practice. 
Assume the variables are all standardized. The first PLS variate has the 
form 

z = ^2(y,Xj)xj, (18.34) 

jer 

and can be thought of as an estimate of the latent factor U in model (18.33). 
In contrast, the supervised principal components direction u satisfies 

11 = Ip H^x^x,, (18.35) 

jev 

where d is the leading singular value of Xp. This follows from the definition 
of the leading principal component. Hence thresholded PLS uses weights 
which are the inner product of y with each of the features, while supervised 
principal components uses the features to derive a “self-consistent” estimate 
u. Since many features contribute to the estimate u, rather than just the 
single outcome y, we can expect u to be less noisy than z. In fact, if there 
are p\ features in the set V, and N, p and p\ go to infinity with p\/N —> 0, 
then it can be shown using the techniques in Bair et al. (2006) that 

z = u + Op(l) 

u = u + O p {\/pi/N), (18.36) 

where u is the true (unobservable) latent variable in the model (18.32), 
(18.33). 

We now present a simulation example to compare the methods numeri¬ 
cally. There are N = 100 samples and p = 5000 genes. We generated the 
data as follows: 
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FIGURE 18.15. Heatmap of the outcome (left column) and first 500 genes from 
a realization from model (18.37). The genes are in the columns, and the samples 
are in the rows. 
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(18.37) 

where e % j and e 4 are independent normal random variables with mean 0 and 
standard deviations 1 and 1.5, respectively. Thus in the first 50 genes, there 
is an average difference of 1 unit between samples 1-50 and 51-100, and this 
difference correlates with the outcome y. The next 200 genes have a large 
average difference of 4 units between samples (1-25, 51 75) and (26-50, 
76-100), but this difference is uncorrelated with the outcome. The rest of 
the genes are noise. Figure 18.15 shows a heatmap of a typical realization, 
with the outcome at the left, and the first 500 genes to the right. 

We generated 100 simulations from this model, and summarize the test 
error results in Figure 18.16. The test errors of principal components and 
partial least squares are shown at the right of the plot; both are badly 
affected by the noisy features in the data. Supervised principal components 
and thresholded PLS work best over a wide range of the number of selected 
features, with the former showing consistently lower test errors. 

While this example seems “tailor-made” for supervised principal com¬ 
ponents, its good performance seems to hold in other simulated and real 
datasets (Bair et ah, 2006). 


18.6.3 Pre-Conditioning for Feature Selection 

Supervised principal components can yield lower test errors than competing 
methods, as shown in Figure 18.16. However, it does not always produce a 
sparse model involving only a small number of features (genes). Even if the 
thresholding in Step (1) of the algorithm yields a relatively small number 
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FIGURE 18.16. Root mean squared test error (i one standard error), for 
supervised principal components and thresholded PLS on 100 realizations from 
model (18.37). All methods use one component, and the errors are relative to 
the noise standard deviation (the Bayes error is 1.0). For both methods, different 
values for the filtering threshold were tried and the number of features retained 
is shown on the horizontal axis. The extreme right points correspond to regular 
principal components and partial least squares, using all the genes. 


of features, it may be that some of the omitted features have sizable inner 
products with the supervised principal component (and could act as a good 
surrogate). In addition, highly correlated features will tend to be chosen 
together, and there may be great deal of redundancy in the set of selected 
features. 

The lasso (Sections 18.4 and 3.4.2), on the other hand, produces a sparse 
model from the data. How do the test errors of the two methods compare on 
the simulated example of the last section? Figure 18.17 shows the test errors 
for one realization from model (18.37) for the lasso, supervised principal 
components, and the pre-conditioned lasso (described below). 

We see that supervised principal components (orange curve) reaches its 
lowest error when about 50 features are included in the model, which is 
the correct number for the simulation. Although a linear model in the first 
50 features is optimal, the lasso (green) is adversely affected by the large 
number of noisy features, and starts overfitting when far fewer are in the 
model. 

Can we get the low test error of supervised principal components along 
with the sparsity of the lasso? This is the goal of pre-conditioning (Paul 
et al., 2008). In this approach, one first computes the supervised principal 
component predictor fji for each observation in the training set (with the 
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FIGURE 18.17. Test errors for the lasso, supervised principal components, 
and pre-conditioned lasso, for one realization from model (18.37). Each model is 
indexed by the number of non-zero features. The supervised principal component 
path is truncated at 250 features. The lasso self-truncates at 100, the sample size 
(see Section 18.4). In this case, the pre-conditioned lasso achieves the lowest error 
with about 25 features. 

threshold selected by cross-validation). Then we apply the lasso with i% as 
the outcome variable, in place of the usual outcome yi- All features are used 
in the lasso fit, not just those that were retained in the thresholding step 
in supervised principal components. The idea is that by first denoising the 
outcome variable, the lasso should not be as adversely affected by the large 
number of noise features. Figure 18.17 shows that pre-conditioning (purple 
curve) has been successful here, yielding much lower test error than the 
usual lasso, and as low (in this case) as for supervised principal components. 
It also can achieve this using less features. The usual lasso, applied to 
the raw outcome, starts to overfit more quickly than the pre-conditioned 
version. Overfitting is not a problem, since the outcome variable has been 
denoised. We usually select the tuning parameter for the pre-conditioned 
lasso on more subjective grounds, like parsimony. 

Pre-conditioning can be applied in a variety of settings, using initial 
estimates other than supervised principal components and post-processors 
other than the lasso. More details may be found in Paul et al. (2008). 


18.7 Feature Assessment and the Multiple-Testing 
Problem 

In the first part of this chapter we discuss prediction models in the p> N 
setting. Here we consider the more basic problem of assessing the signif- 





684 


18. High-Dimensional Problems: p> IV 


icance of each of the p features. Consider the protein mass spectrometry 
example of Section 18.4.1. In that problem, the scientist might not be inter¬ 
ested in predicting whether a given patient has prostate cancer. Rather the 
goal might be to identify proteins whose abundance differs between nor¬ 
mal and cancer samples, in order to enhance understanding of the disease 
and suggest targets for drug development. Thus our goal is to assess the 
significance of individual features. This assessment is usually done without 
the use of a multivariate predictive model like those in the first part of this 
chapter. The feature assessment problem moves our focus from prediction 
to the traditional statistical topic of multiple hypothesis testing. For the 
remainder of this chapter we will use M instead of p to denote the number 
of features, since we will frequently be referring to p-values. 


TABLE 18.4. Subset of the 12,625 genes from microarray study of radiation 
sensitivity. There are a total of 44 samples in the normal group and 14 in the 
radiation sensitive group; we only show three samples from each group. 


Normal Radiation Sensitive 


Gene 1 

7.85 

29.74 

29.50 ... 

17.20 

-50.75 

-18.89 ... 

Gene 2 

15.44 

2.70 

19.37 ... 

6.57 

-7.41 

79.18 ... 

Gene 3 

-1.79 

15.52 

-3.13 ... 

-8.32 

12.64 

4.75 ... 

Gene 4 

-11.74 

22.35 

-36.11 ... 

-52.17 

7.24 

-2.32 ... 

Gene 12,625 

-14.09 

32.77 

57.78 ... 

-32.84 

24.09 

-101.44 ... 


Consider, for example, the microarray data in Table 18.4, taken from a 
study on the sensitivity of cancer patients to ionizing radiation treatment 
(Rieger et al., 2004). Each row consists of the expression of genes in 58 
patient samples: 44 samples were from patients with a normal reaction, and 
14 from patients who had a severe reaction to radiation. The measurements 
were made on oligo-nucleotide microarrays. The object of the experiment 
was to find genes whose expression was different in the radiation sensitive 
group of patients. There are M = 12, 625 genes altogether; the table shows 
the data for some of the genes and samples for illustration. 

To identify informative genes, we construct a two-sample t-statistic for 
each gene. 


= % sii ( 18 . 38 ) 

se j 

where Xkj = Y^ieCe X ij/Ne- Here Cf are the indices of the Nt samples in 
group £, where l = 1 is the normal group and t = 2 is the sensitive group. 
The quantity se^ is the pooled within-group standard error for gene j: 
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FIGURE 18.18. Radiation sensitivity microarray example. A histogram of the 
12,625 t-statistics comparing the radiation-sensitive versus insensitive groups. 
Overlaid in blue is the histogram of the t-statistics from 1000 permutations of the 
sample labels. 


se j — \Jn-i + n 2 > ~ jVi+iV2-2 ( ( x ij X2 j) 

\£GC?i iGC*2 

(18.39) 

A histogram of the 12,625 t-statistics is shown in orange in Figure 18.18, 
ranging in value from —4.7 to 5.0. If the t :j values were normally distributed 
we could consider any value greater than two in absolute value to be sig¬ 
nificantly large. This would correspond to a significance level of about 5%. 
Here there are 1189 genes with \t 3 \ > 2. However with 12,625 genes we 
would expect many large values to occur by chance, even if the group¬ 
ing is unrelated to any gene. For example, if the genes were independent 
(which they are surely not), the number of falsely significant genes would 
have a binomial distribution with mean 12, 625 • 0.05 = 631.3 and standard 
deviation 24.5; the actual 1189 is way out of range. 

How do we assess the results for all 12,625 genes? This is called the mul¬ 
tiple testing problem. We can start as above by computing a p-value for 
each gene. This can be done using the theoretical f-distribution probabil¬ 
ities, which assumes the features are normally distributed. An attractive 
alternative approach is to use the permutation distribution, since it avoids 
assumptions about the distribution of the data. We compute (in principle) 
all K = ( 44 ) permutations of the sample labels, and for each permutation 
k compute the t-statistics tj. Then the p -value for gene j is 
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p, = ^X>( l#>N)- ( 18 - 4 °) 

fe=l 

Of course, ( 33 ) is a large number (around 10 13 ) and so we can’t enumer¬ 
ate all of the possible permutations. Instead we take a random sample of 
the possible permutations; here we took a random sample of K = 1000 
permutations. 

To exploit the fact that the genes are similar (e.g., measured on the 
same scale), we can instead pool the results for all genes in computing the 
p- values. 

M K 

j'= 1 fc=l 

This also gives more granular p -values than does (18.40), since there many 
more values in the pooled null distribution than there are in each individual 
null distribution. 

Using this set of p-values, we would like to test the hypotheses: 

H 0 j = treatment has no effect on gene j 

versus (18.42) 

H\j = treatment has an effect on gene j 

for all j = 1,2,..., M. We reject H 0 j at level a if pj < a. This test has 
type-I error equal to a; that is, the probability of falsely rejecting Hoj is a. 

Now with many tests to consider, it is not clear what we should use 
as an overall measure of error. Let Aj be the event that H 0 j is falsely 
rejected; by definition Pr(A,) = a. The family-wise error rate (FWER) 
is the probability of at least one false rejection, and is a commonly used 
overall measure of error. In detail, if A = U fLiAj is the event of at least 
one false rejection, then the FWER is Pr(A). Generally Pr(A) a for 
large M, and depends on the correlation between the tests. If the tests are 
independent each with type-I error rate a , then the family-wise error rate 
of the collection of tests is (1 — (1 — a) M ). On the other hand, if the tests 
have positive dependence, that is Pr(A J jAfc) > Pr {Aj), then the FWER 
will be less than (1 — (1 — ct) M ). Positive dependence between tests often 
occurs in practice, in particular in genomic studies. 

One of the simplest approaches to multiple testing is the Bonferroni 
method. It makes each individual test more stringent, in order to make the 
FWER equal to at most a: we reject Hqj if pj < a/M. It is easy to show 
that the resulting FWER is < a (Exercise 18.16). The Bonferroni method 
can be useful if M is relatively small, but for large M it is too conservative, 
that is, it calls too few genes significant. 

In our example, if we test at level say a = 0.05, then we must use the 
threshold 0.05/12, 625 = 3.9 x 10~ 6 . None of the 12,625 genes had a p -value 
this small. 
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There are variations to this approach that adjust the individual p -values 
to achieve an FWER of at most a , with some approaches avoiding the 
assumption of independence; see, e.g., Dudoit et al. (2002b). 


18.7.1 The False Discovery Rate 

A different approach to multiple testing does not try to control the FWER, 
but focuses instead on the proportion of falsely significant genes. As we will 
see, this approach has a strong practical appeal. 

Table 18.5 summarizes the theoretical outcomes of M hypothesis tests. 
Note that the family-wise error rate is Pr(E > 1). Here we instead focus 

TABLE 18.5. Possible outcomes from M hypothesis tests. Note that V is the 
number of false-positive tests; the type-I error rate is E(V)/Mo. The type-II error 
rate is E(T)/Mi, and the power is 1 — E(T)/Mi. 



Called 

Not Significant 

Called 

Significant 

Total 

Ho True 

U 

V 

M 0 

H 0 False 

T 

S 

Mi 

Total 

M-R 

R 

M 


on the false discovery rate 

FDR = E(V/R). (18.43) 

In the microarray setting, this is the expected proportion of genes that 
are incorrectly called significant, among the R genes that are called signif¬ 
icant. The expectation is taken over the population from which the data 
are generated. Benjamini and Hochberg (1995) first proposed the notion of 
false discovery rate, and gave a testing procedure (Algorithm 18.2) whose 
FDR is bounded by a user-defined level a. The Benjamini-Hochberg (BH) 
procedure is based on p-values; these can be obtained from an asymptotic 
approximation to the test statistic (e.g., Gaussian), or a permutation dis¬ 
tribution, as is done here. 

If the hypotheses are independent, Benjamini and Hochberg (1995) show 
that regardless of how many null hypotheses are true and regardless of the 
distribution of the p -values when the null hypothesis is false, this procedure 
has the property 

FDR < ~^ru < a. (18.45) 

For illustration we chose a = 0.15. Figure 18.19 shows a plot of the or¬ 
dered p-values p(j\, and the line with slope 0.15/12625. 
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Algorithm 18.2 Benjamini-Hochberg (BH) Method. 

1. Fix the false discovery rate a and let p(i) < p^) < ■ ■ ■ < P(m) denote 
the ordered p -values 

2. Define 


L = maxjj : p {j) < a ■ - yj. (18.44) 

3. Reject all hypotheses H 0 j for which pj < P(l)> the BH rejection 
threshold. 



Genes ordered by p-value 

FIGURE 18.19. Microarray example continued. Shown is a plot of the ordered 
p-values p(j) and the line 0.15 • (j/12,625), for the Benjamini-Hochberg method. 
The largest j for which the p-value p^j) falls below the line, gives the BH threshold. 
Here this occurs at j = 11, indicated by the vertical line. Thus the BH method 
calls significant the 11 genes (in red) with smallest p-values. 
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Algorithm 18.3 The Plug-in Estimate of the False Discovery Rate. 

1. Create K permutations of the data, producing f-statistics t* for fea¬ 
tures j = 1, 2,..., M and permutations k = 1, 2,..., K. 

2. For a range of values of the cut-point C, let 


M 


M K 


Robs = > C), E?) = -^^^|>C). (18.46) 


0=1 k— 1 


3. Estimate the FDR by FDR = E(V)/i? 0 b s . 


Starting at the left and moving right, the BH method finds the last time 
that the p-values fall below the line. This occurs at j = 11, so we reject 
the 11 genes with smallest p- values. Note that the cutoff occurs at the 11th 
smallest p- value, 0.00012, and the 11th largest of the values 1 1 3 | is 4.101 
Thus we reject the 11 genes with 11 3 \ > 4.101. 

From our brief description, it is not clear how the BH procedure works; 
that is, why the corresponding FDR is at most 0.15, the value used for a. 
Indeed, the proof of this fact is quite complicated (Benjamini and Hochberg, 
1995). 

A more direct way to proceed is a plug-in approach. Rather than starting 
with a value for a , we fix a cut-point for our f-statistics, say the value 
4.101 that appeared above. The number of observed values 1 1 7 | equal or 
greater than 4.101 is 11. The total number of permutation values \tj\ equal 
or greater than 4.101 is 1518, for an average of 1518/1000 = 1.518 per 
permutation. Thus a direct estimate of the false discovery rate is FDR = 
1.518/11 ps 14%. Note that 14% is approximately equal to the value of 
a = 0.15 used above (the difference is due to discreteness). This procedure 
is summarized in Algorithm 18.3. To recap: 

The plug-in estimate of FDR of Algorithm 18.3 is equivalent to the BH 
procedure of Algorithm 18.2, using the permutation p-values (18. fO). 

This correspondence between the BH method and the plug-in estimate is 
not a coincidence. Exercise 18.17 shows that they are equivalent in general. 
Note that this procedure makes no reference to p-values at all, but rather 
works directly with the test statistics. 

The plug-in estimate is based on the approximation 



(18.47) 


and in general FDR is a consistent estimate of FDR (Storey, 2002; Storey et 
al., 2004). Note that the numerator E(V) actually estimates (M/Mq)E(H), 
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since the permutation distribution uses M rather Mq null hypotheses. 
Hence if an estimate of Mq is available, a better estimate of FDR can be 
obtained from (M 0 /M) ■ FDR. Exercise 18.19 shows a way to estimate Mq. 
The most conservative (upwardly biased) estimate of FDR uses M 0 = M. 
Equivalently, an estimate of Mq can be used to improve the BH method, 
through relation (18.45). 

The reader might be surprised that we chose a value as large as 0.15 for 
a, the FDR bound. We must remember that the FDR is not the same as 
type-I error, for which 0.05 is the customary choice. For the scientist, the 
false discovery rate is the expected proportion of false positive genes among 
the list of genes that the statistician tells him are significant. Microarray 
experiments with FDRs as high as 0.15 might still be useful, especially if 
they are exploratory in nature. 


18.7.2 Asymmetric Outpoints and the SAM Procedure 

In the testing methods described above, we used the absolute value of the 
test statistic tj , and hence applied the same cut-points to both positive and 
negative values of the statistic. In some experiments, it might happen that 
most or all of the differentially expressed genes change in the positive direc¬ 
tion (or all in the negative direction). For this situation it is advantageous 
to derive separate cut-points for the two cases. 

The significance analysis of microarrays (SAM) approach offers a way of 
doing this. The basis of the SAM method is shown in Figure 18.20. On the 
vertical axis we have plotted the ordered test statistics f(i) < f( 2 ) < ■ • • < 
t(M), while the horizontal axis shows the expected order statistics from the 
permutations of the data: = (1 /K) Y2k=i tfj)> w here < • • • < 

are the ordered test statistics from permutation k. 

Two lines are drawn, parallel to the 45° line, A units away. Starting at 
the origin and moving to the right, we find the first place that the genes 
leave the band. This defines the upper cutpoint C hi and all genes beyond 
that point are called significant (marked red). Similarly we find the lower 
cutpoint C low for genes in the bottom left corner. Thus each value of the 
tuning parameter A defines upper and lower cutpoints, and the plug-in 
estimate FDR for each of these cutpoints is estimated as before. Typically 
a range of values of A and associated FDR values are computed, from which 
a particular pair are chosen on subjective grounds. 

The advantage of the SAM approach lies in the possible asymmetry of 
the cutpoints. In the example of Figure 18.20, with A = 0.71 we obtain 
11 significant genes; they are all in the upper right. The data points in the 
bottom left never leave the band, and hence C low = — oo. Hence for this 
value of A, no genes are called significant on the left (negative) side. We 
do not impose symmetry on the cutpoints, as was done in Section 18.7.1, 
as there is no reason to assume similar behavior at the two ends. 
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Expected Order Statistics 

FIGURE 18.20. SAM plot for the radiation sensitivity microarray data. On the 
vertical axis we have plotted the ordered test statistics, while the horizontal axis 
shows the expected order statistics of the test statistics from permutations of the 
data. Two lines are drawn, parallel to the 45° line, A units away from it. Starting 
at the origin and moving to the right, we find the first place that the genes leave 
the band. This defines the upper cut-point C h i and all genes beyond that point are 
called significant (marked in red). Similarly we define a lower outpoint Ci ou . For 
the particular value of A = 0.71 in the plot, no genes are called significant in the 
bottom left. 
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There is some similarity between this approach and the asymmetry possi¬ 
ble with likelihood-ratio tests. Suppose we have a log-likelihood £q( tj ) under 
the null-hypothesis of no effect, and a log-likelihood £(tj) under the alterna¬ 
tive. Then a likelihood ratio test amounts to rejecting the null-hypothesis 
if 

(18.48) 

for some A. Depending on the likelihoods, and particularly their relative 
values, this can result in a different threshold for tj than for —tj. The SAM 
procedure rejects the null-hypothesis if 


(18.49) 

Again, the threshold for each tyy depends on the corresponding value of 
the null value t^y 

18.7.3 A Bayesian Interpretation of the FDR 

There is an interesting Bayesian view of the FDR, developed in Storey 
(2002) and Efron and Tibshirani (2002). First we need to define the positive 
false discovery rate (pFDR) as 



pFDR = E 


V 

R 


R> 0 


(18.50) 


The additional term positive refers to the fact that we are only interested 
in estimating an error rate where positive findings have occurred. It is 
this slightly modified version of the FDR that has a clean Bayesian inter¬ 
pretation. Note that the usual FDR [expression (18.43)] is not defined if 
Pr(i? = 0) > 0. 

Let r be a rejection region for a single test; in the example above we used 
T = (—oo,—4.10) U (4.10,oo). Suppose that M identical simple hypothe¬ 
sis tests are performed with the i.i.d. statistics t\,...,tM and rejection 
region T. We define a random variable Zj which equals 0 if the jp'th null 
hypothesis is true, and 1 otherwise. We assume that each pair ( tj,Zj ) are 
i.i.d random variables with 


tj\Zj ~ (1 — Zj) • Fq + Zj ■ F\ (18.51) 

for some distributions Fq and F\. This says that each test statistic tj comes 
from one of two distributions: Fq if the null hypothesis is true, and Fj 
otherwise. Letting Pr (Zj =0) = ttq, marginally we have: 

tj ~ 7T 0 • F 0 + (1 - 7T 0 ) • Fi. (18.52) 

Then it can be shown (Efron et ah, 2001; Storey, 2002) that 
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pFDR(r) = Pr(Zj = o |tj e r). 


(18.53) 


Hence under the mixture model (18.51), the pFDR is the posterior proba¬ 
bility that the null hypothesis it true, given that test statistic falls in the 
rejection region for the test; that is, given that we reject the null hypothesis 
(Exercise 18.20). 

The false discovery rate provides a measure of accuracy for tests based 
on an entire rejection region, such as \tj\ > 2. But if the FDR of such a test 
is say 10%, then a gene with say tj = 5 will be more significant than a gene 
with tj = 2. Thus it is of interest to derive a local (gene-specific) version 
of the FDR. The q-value (Storey, 2003) of a test statistic tj is defined to 
be the smallest FDR over all rejection regions that reject tj. That is, for 
symmetric rejection regions, the g-value for tj = 2 is defined to be the 
FDR for the rejection region T = {—(oo,— 2) U (2, oo)}. Thus the g-value 
for tj = 5 will be smaller than that for tj = 2, reflecting the fact that tj = 5 
is more significant than tj = 2. The local false discovery rate (Efron and 
Tibshirani, 2002) at t = to is defined to be 


p r (Zj =0| tj = to). 


(18.54) 


This is the (positive) FDR for an infinitesimal rejection region surrounding 
the value tj = to- 


18.8 Bibliographic Notes 


Many references were given at specific points in this chapter; we give some 
additional ones here. Dudoit et al. (2002a) give an overview and compar¬ 
ison of discrimination methods for gene expression data. Levina (2002) 
does some mathematical analysis comparing diagonal LDA to full LDA, as 
p, N —> oo with p > N. She shows that with reasonable assumptions diago¬ 
nal LDA has a lower asymptotic error rate than full LDA. Tibshirani et al. 
(2001a) and Tibshirani et al. (2003) proposed the nearest shrunken-centroid 
classifier. Zhu and Hastie (2004) study regularized logistic regression. High¬ 
dimensional regression and the lasso are very active areas of research, and 
many references are given in Section 3.8.5. The fused lasso was proposed 
by Tibshirani et al. (2005), while Zou and Hastie (2005) introduced the 
elastic net. Supervised principal components is discussed in Bair and Tib¬ 
shirani (2004) and Bair et al. (2006). For an introduction to the analysis 
of censored survival data, see Kalbfleisch and Prentice (1980). 

Microarray technology has led to a flurry of statistical research: see for 
example the books by Speed (2003), Parmigiani et al. (2003), Simon et al. 
(2004), and Lee (2004). 

The false discovery rate was proposed by Benjamini and Hochberg (1995), 
and studied and generalized in subsequent papers by these authors and 
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many others. A partial list of papers on FDR may be found on Yoav Ben- 
jamini’s homepage. Some more recent papers include Efron and Tibshirani 
(2002), Storey (2002), Genovese and Wasserman (2004), Storey and Tib¬ 
shirani (2003) and Benjamini and Yekutieli (2005). Dudoit et al. (2002b) 
review methods for identifying differentially expressed genes in microarray 
studies. 


Exercises 


Ex. 18.1 For a coefficient estimate let (ij/\ \8j \ |2 be the normalized ver¬ 
sion. Show that as A —> 00 , the normalized ridge-regression estimates con¬ 
verge to the renormalized partial-least-squares one-component estimates. 

Ex. 18.2 Nearest shrunken centroids and the lasso. Consider a (naive Bayes) 
Gaussian model for classification in which the features j = 1, 2,... ,p are 
assumed to be independent within each class k = 1,2,..., AT. With ob¬ 
servations i = 1,2, ... ,N and Ck equal to the set of indices of the Nk 
observations in class k , we observe Xij ~ + fXjk,(Jj) for i £ Ck with 

EaLi l L ]k = 0. Set <7? = s'j, the pooled within-class variance for feature j, 
and consider the lasso-style minimization problem 


min 

{l J, j .Ujkl 


p K 

Jeee 

i=i fc=1 iec k 


(■Xij A tj h-jk) 



j =1 fc=1 



18.55) 


Show that the solution is equivalent to the nearest shrunken centroid es¬ 
timator (18.5), with so set to zero, and Mk equal to 1 /Nk instead of 
1 /Nk — 1/N as before. 

Ex. 18.3 Show that the fitted coefficients for the regularized multiclass 
logistic regression problem (18.10) satisfy Y^k=\Pkj = 0, j = 1 ,...,p. 
What about the /3*,o? Discuss issues with these constant parameters, and 
how they can be resolved. 

Ex. 18.4 Derive the computational formula (18.15) for ridge regression. 
[Hint: Use the first derivative of the penalized sum-of-squares criterion to 
show that if A > 0, then /3 = X T s for some s € M' V .] 

Ex. 18.5 Prove the theorem (18.16)-(18.17) in Section 18.3.5, by decom¬ 
posing /3 and the rows of X into their projections into the column space of 
V and its complement in 1R P . 

Ex. 18.6 Show how the theorem in Section 18.3.5 can be applied to regu¬ 
larized discriminant analysis [Section 4.14 and Equation (18.9)]. 
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Ex. 18.7 Consider a linear regression problem where p TV, and assume 
the rank of X is N. Let the SVD of X = UDV T = RV T , where R is 
N x TV nonsingular, and V is p x TV with orthonormal columns. 

(a) Show that there are infinitely many least-squares solutions all with 
zero residuals. 

(b) Show that the ridge-regression estimate for /? can be written 

Px = V(R T R + AI)- 1 R T y (18.56) 

(c) Show that when A = 0, the solution /3 q = VD _1 U T y has residuals 
all equal to zero, and is unique in that it has the smallest Euclidean 
norm amongst all zero-residual solutions. 

Ex. 18.8 Data Piling. Exercise 4.2 shows that the two-class LDA solution 
can be obtained by a linear regression of a binary response vector y con¬ 
sisting of —Is and +ls. The prediction j3 T x for any x is (up to a scale and 
shift) the LDA score 5(x). Suppose now that p> TV. 

(a) Consider the linear regression model f(x) = a + (3 T x fit to a binary 
response Y £ {—1,+1}. Using Exercise 18.7, show that there are 
infinitely many directions defined by /3 in ]R P onto which the data 
project to exactly two points, one for each class. These are known as 
data piling directions (Ahn and Marron, 2005). 

(b) Show that the distance between the projected points is 2/||/3||, and 
hence these directions define separating hyperplanes with that mar¬ 
gin. 

(c) Argue that there is a single maximal data piling direction for which 
this distance is largest, and is defined by /3 0 = VD _1 U r y = X~y, 
where X = UDV T is the SVD of X. 

Ex. 18.9 Compare the data piling direction of Exercise 18.8 to the direction 
of the optimal separating hyperplane (Section 4.5.2) qualitatively. Which 
makes the widest margin, and why? Use a small simulation to demonstrate 
the difference. 

Ex. 18.10 When p N, linear discriminant analysis (see Section 4.3) is 
degenerate because the within-class covariance matrix W is singular. One 
version of regularized discriminant analysis (4.14) replaces W by a ridged 
version W + AI, leading to a regularized discriminant function 5\{x) = 
x T (W + AI) 1 (a;i — x_i). Show that 8q{x) = lim^o (5> (x) corresponds to 
the maximal data piling direction defined in Exercise 18.8. 

Ex. 18.11 Suppose you have a sample of N pairs ( Xi,yi ), with y* binary 
and Xi £ IR 1 . Suppose also that the two classes are separable; e.g., for each 
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pair i,i' with yi = 0 and yv = 1, av — Xi > C for some C > 0. You wish 
to fit a linear logistic regression model logitPr(Y = 1|X) = a + (3X by 
maximum-likelihood. Show that /3 is undefined. 

Ex. 18.12 Suppose we wish to select the ridge parameter A by 10-fold cross- 
validation in a p > Y situation (for any linear model). We wish to use the 
computational shortcuts described in Section 18.3.5. Show that we need 
only to reduce the N x p matrix X to the N x N matrix R once , and can 
use it in all the cross-validation runs. 

Ex. 18.13 Suppose our p > N predictors are presented as an IV x TV inner- 
product matrix K = XX T , and we wish to fit the equivalent of a linear 
logistic regression model in the original features with quadratic regulariza¬ 
tion. Our predictions are also to be made using inner products; a new xo 
is presented as kg = Xxo. Let K = UD 2 U T be the eigen-decomposition of 
K. Show that the predictions are given by fg = kg a, where 

(a) a = UD- 1 /?, and 

(b) /3 is the ridged logistic regression estimate with input matrix R = 

UD. 

Argue that the same approach can be used for any appropriate kernel 
matrix K. 

Ex. 18.14 Distance weighted 1-NN classification. Consider the 1-nearest- 
neighbor method (Section 13.3) in a two-class classification problem. Let 
d-i-(xo) be the shortest distance to a training observation in class +1, and 
likewise d_(x o) the shortest distance for class —1. Let 1V_ be the number 
of samples in class —1, N + the number in class +1, and N = 1V_ + N + . 

(a) Show that 

S{x 0 ) = log (18.57) 

d+\x o) 

can be viewed as a nonparametric discriminant function correspond¬ 
ing to 1-NN classification. [Hint: Show that /+( xq) = N+d 1 f _( Xo ) can 
be viewed as a nonparametric estimate of the density in class +1 at 
*o]- 

(b) How would you modify this function to introduce class prior probabil¬ 
ities 7r + and 7r_ different from the sample-priors N + /N and N_/N1 

(c) How would you generalize this approach for K-NN classification? 

Ex. 18.15 Kernel PCA. In Section 18.5.2 we show how to compute the 
principal component variables Z from an uncentered inner-product matrix 
K. We compute the eigen-decomposition (I — M)K(I — M) = UD 2 U T , 
with M = ll T /iV, and then Z = UD. Suppose we have the inner-product 
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vector ko, containing the N inner-products between a new point Xo and 
each of the x, in our training set. Show that the (centered) projections of 
xo onto the principal-component directions are given by 

z 0 = D _ 1 U t (I — M) [k 0 — Kl/iV]. (18.58) 

Ex. 18.16 Bonferroni method for multiple comparisons. Suppose we are in 
a multiple-testing scenario with null hypotheses Hoj, j = 1,2,, M, and 
corresponding p- values Pj, i = 1,2,..., M. Let A be the event that at least 
one null hypothesis is falsely rejected, and let Aj be the event that the 
j th null hypothesis is falsely rejected. Suppose that we use the Bonferroni 
method, rejecting the jth null hypothesis if pj < a/M. 

(a) Show that Pr(A) < a. [Hint: Pr(A, U Aji) = Pr(Aj) + Pr(A,<) — 
Pr (AjCAj')] 

(b) If the hypotheses H 0 j,j = 1,2,..., M, are independent, then Pr(A) = 

1 — Pr(A c ) = 1 — Ilyii Pr(A^) = 1 — (1 — a/M) M . Use this to show 
that Pr(A) w a in this case. 


Ex. 18.17 Equivalence between Benjamini-Hochberg and plug-in methods. 

(a) In the notation of Algorithm 18.2, show that for rejection threshold 
Po = P{l) i a proportion of at most po of the permuted values t k - 
exceed |Tj(i) where \T\(l) is the Lth largest value among the | tj\. 
Hence show that the plug-in FDR estimate FDR is less than or equal 
to po ■ M/L = a. 

(b) Show that the cut-point |T|( i+1 ) produces a test with estimated FDR 

greater than a. 

Ex. 18.18 Use result (18.53) to show that 


pFDR 


7To • {Type I error of T} 

7i- 0 • {Type I error of T} + 7Ti{Power of T} 


(18.59) 


(Storey, 2003). 

Ex. 18.19 Consider the data in Table 18.4 of Section (18.7), available from 
the book website. 


(a) Using a symmetric two-sided rejection region based on the t-statistic, 

compute the plug-in estimate of the FDR for various values of the 
cut-point. 

(b) Carry out the BH procedure for various FDR levels a and show the 
equivalence of your results, with those from part (a). 
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(c) Let (q. 25 , 9 . 75 ) be the quartiles of the f-statistics from the permuted 

datasets. Let 7r 0 = {#tj € ( 9 . 25 ,<7.75)}/(-5M), and set 7r 0 = min(7fo, 1). 
Multiply the FDR estimates from (a) by fro and examine the results. 

(d) Give a motivation for the estimate in part (c). 

(Storey, 2003) 

Ex. 18.20 Proof of result (18.53). Write 



(18.60) 


pFDR 


(18.61) 


Use the fact that given R = fc, V is a binomial random variable, with k 
trials and probability of success Pr(H = 0|T £ T), to complete the proof. 
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microarray, 5, 505, 532 
nested spheres, 590 
New Zealand fish, 375-379 
nuclear magnetic resonance, 
176 

ozone, 201 

prostate cancer, 3, 49, 61, 608 
protein mass spectrometry, 664 
satellite image, 470 
skin of the orange, 429-432 
spam, 2, 300-304, 313, 320, 
328, 352, 593 
vowel, 440, 464 
waveform, 451 
ZIP code, 4, 404, 536-539 
Archetypal analysis, 554-557 
Association rules, 492-495, 499- 
501 
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Automatic relevance determination, 
411 

Automatic selection of smoothing 
parameters , 156 

B-Spline, 186 

Back-propagation, 392-397, 408- 
409 

Backfitting, 297, 391 
Backward 

selection, 58 
stepwise selection, 59 
Backward pass, 396 
Bagging, 282-288, 409, 587 
Basis expansions and regulariza¬ 
tion, 139-189 

Basis functions, 141,186,189, 321, 
328 

Batch learning, 397 
Baum-Welch algorithm, 272 
Bayes 

classifier, 21 
factor, 234 

methods, 233-235, 267-272 
rate, 21 
Bayesian, 409 

Bayesian information criterion (BIC), 
233 

Benjamini-Hochberg method, 688 
Best-subset selection, 57, 610 
Between class covariance matrix, 

114 

Bias, 16, 24, 37, 160, 219 
Bias-variance decomposition, 24, 

37, 219 

Bias-variance tradeoff, 37, 219 
BIC, see Bayesian Information Cri¬ 
terion 

Boltzmann machines, 638-648 
Bonferroni method, 686 
Boosting, 337-386, 409 

as lasso regression, 607-609 
exponential loss and AdaBoost, 
343 

gradient boosting, 358 


implementations, 360 
margin maximization, 613 
numerical optimization, 358 
partial-dependence plots, 369 
regularization path, 607 
shrinkage, 364 

stochastic gradient boosting, 

365 

tree size, 361 
variable importance, 367 
Bootstrap, 249, 261-264, 267, 271 
282, 587 

relationship to Bayesian method, 
271 

relationship to maximum like¬ 
lihood method, 267 
Bottom-up clustering, 520-528 
Bump hunting, see Patient rule 
induction method 
Bumping, 290-292 

C5.0, 624 

Canonical variates, 441 
CART, see Classification and re¬ 
gression trees 

Categorical predictors, 10, 310 
Censored data, 674 
Classical multidimensional scaling, 
570 

Classification, 22, 101-137, 305- 
317, 417-429 

Classification and regression trees 
(CART), 305-317 
Clique, 628 
Clustering, 501-528 
fc-means, 509-510 
agglomerative, 523-528 
hierarchical, 520-528 
Codebook, 515 

Combinatorial algorithms, 507 
Combining models, 288-290 
Committee, 289, 587, 605 
Comparison of learning methods, 
350-352 

Complete data, 276 
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Complexity parameter, 37 
Computational shortcuts 
quadratic penalty, 659 
Condensing procedure, 480 
Conditional likelihood, 31 
Confusion matrix, 301 
Conjugate gradients, 396 
Consensus, 285-286 
Convolutional networks, 407 
Coordinate descent, 92, 636, 668 
COSSO, 304 

Cost complexity pruning, 308 
Covariance graph, 631 
C p statistic, 230 
Cross-entropy, 308-310 
Cross-validation, 241-245 
Cubic smoothing spline, 151-153 
Cubic spline, 151- 153 
Curse of dimensionality, 22-26 

Dantzig selector, 89 
Data augmentation, 276 
Daubechies symmlet-8 wavelets, 
176 

De-correlation, 597 
Decision boundary, 13-15, 21 
Decision trees, 305-317 
Decoder, 515, see encoder 
Decomposable models, 641 
Degrees of freedom 

in an additive model, 302 
in ridge regression, 68 
of a tree, 336 

of smoother matrices, 153-154, 
158 

Delta rule, 397 

Demmler-Reinsch basis for splines, 
156 

Density estimation, 208-215 
Deviance, 124, 309 
Diagonal linear discriminant anal¬ 
ysis, 651-654 
Dimension reduction, 658 

for nearest neighbors, 479 
Discrete variables, 10, 310-311 


Discriminant 

adaptive nearest neighbor clas¬ 
sifier, 475-480 
analysis, 106-119 
coordinates, 108 
functions, 109-110 
Dissimilarity measure, 503-504 
Dummy variables, 10 

Early stopping, 398 

Effective degrees of freedom, 17, 

68,153-154,158, 232, 302, 
336 

Effective number of parameters, 

15, 68,153-154,158, 232, 
302, 336 

Eigenvalues of a smoother matrix, 

154 

Elastic net, 662 
EM algorithm, 272-279 

as a maximization-maximization 
procedure, 277 
for two component Gaussian 
mixture, 272 
Encoder, 514-515 
Ensemble, 616-623 
Ensemble learning, 605-624 
Entropy, 309 
Equivalent kernel, 156 
Error rate, 219-230 
Error-correcting codes, 606 
Estimates of in-sample prediction 
error, 230 

Expectation-maximization algorithm, 
see EM algorithm 
Extra-sample error, 228 

False discovery rate, 687-690, 692, 

693 

Feature, 1 

extraction, 150 
selection, 409, 658, 681-683 
Feed-forward neural networks, 392- 
408 
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Fisher’s linear discriminant, 106- 
119, 438 

Flexible discriminant analysis, 440- 
445 

Forward 

selection, 58 
stagewise, 86, 608 
stagewise additive modeling, 

342 

stepwise, 73 

Forward pass algorithm, 395 
Fourier transform, 168 
Frequentist methods, 267 
Function approximation, 28-36 
Fused lasso, 666 

Gap statistic, 519 
Gating networks, 329 
Gauss-Markov theorem, 51 52 
Gauss-Newton method, 391 
Gaussian (normal) distribution, 16 
Gaussian graphical model, 630 
Gaussian mixtures, 273, 463, 492, 

509 

Gaussian radial basis functions, 

212 

GBM, see Gradient boosting 
GBM package, see Gradient boost¬ 
ing 

GCV, see Generalized cross-validation 
GEM (generalized EM), 277 
Generalization 
error, 220 
performance, 220 
Generalized additive model, 295- 
304 

Generalized association rules, 497 
499 

Generalized cross-validation, 244 
Generalized linear discriminant anal¬ 
ysis, 438 

Generalized linear models, 125 
Gibbs sampler, 279-280, 641 
for mixtures, 280 
Gini index, 309 


Global Markov property, 628 
Gradient Boosting, 359-361 
Gradient descent, 358, 395-397 
Graph Laplacian, 545 
Graphical lasso, 636 
Grouped lasso, 90 

Haar basis function, 176 
Hammersley-Clifford theorem, 629 
Hard-thresholding, 653 
Hat matrix, 46 
Helix, 582 
Hessian matrix, 121 
Hidden nodes, 641-642 
Hidden units, 393-394 
Hierarchical clustering, 520-528 
Hierarchical mixtures of experts, 
329-332 

High-dimensional problems, 649 
Hints, 96 

Hyperplane, see Separating Hy¬ 
perplane 

ICA, see Independent components 
analysis 

Importance sampling, 617 
In-sample prediction error, 230 
Incomplete data, 332 
Independent components analysis, 
557-570 

Independent variables, 9 
Indicator response matrix, 103 
Inference, 261-294 
Information 
Fisher, 266 
observed, 274 

Information theory, 236, 561 
Inner product, 53, 668, 670 
Inputs, 10 

Instability of trees, 312 
Intercept, 11 
Invariance manifold, 471 
Invariant metric, 471 
Inverse wavelet transform, 179 
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IRLS, see Iteratively reweighted 
least squares 
Irreducible error, 224 
Ising model, 638 
ISOMAP, 572 

Isometric feature mapping, 572 
Iterative proportional scaling, 585 
Iteratively reweighted least squares 
(IRLS), 121 

Jensen’s inequality, 293 
Join tree, 629 
Junction tree, 629 

K-means clustering, 460, 509-514 
K-medoid clustering, 515-520 
K-nearest neighbor classifiers, 463 
Karhunen-Loeve transformation (prin¬ 
cipal components), 66- 
67, 79, 534-539 

Karush-Kuhn-Tucker conditions, 

133, 420 

Kernel 

classification, 670 
density classification, 210 
density estimation, 208-215 
function, 209 
logistic regression, 654 
principal component, 547-550 
string, 668-669 
trick, 660 

Kernel methods, 167-176, 208-215, 
423-438, 659 
Knot, 141, 322 
Kriging, 171 

Kruskal-Shephard scaling, 570 
Kullback-Leibler distance, 561 

Lagrange multipliers, 293 
Landmark, 539 
Laplacian, 545 
Laplacian distribution, 72 
LAR, see Least angle regression 
Lasso, 68-69, 86-90, 609, 635, 636, 

661 


fused, 666 
Latent 

factor, 674 
variable, 678 
Learning, 1 
Learning rate, 396 
Learning vector quantization, 462 
Least angle regression, 73-79, 86, 
610 

Least squares, 11, 32 
Leave-one-out cross-validation, 243 
LeNet, 406 

Likelihood function, 265, 273 
Linear basis expansion, 139-148 
Linear combination splits, 312 
Linear discriminant function, 106- 
119 

Linear methods 

for classification, 101 -137 
for regression, 43-99 
Linear models and least squares, 

11 

Linear regression of an indicator 
matrix, 103 

Linear separability, 129 
Linear smoother, 153 
Link function, 296 
LLE, see Local linear embedding 
Local false discovery rate, 693 
Local likelihood, 205 
Local linear embedding, 572 
Local methods in high dimensions, 
22-27 

Local minima, 400 
Local polynomial regression, 197 
Local regression, 194, 200 
Localization in time/frequency, 175 
Loess (local regression), 194, 200 
Log-linear model, 639 
Log-odds ratio (logit), 119 
Logistic (sigmoid) function, 393 
Logistic regression, 119-128, 299 
Logit (log-odds ratio), 119 
Loss function, 18, 21, 219-223, 346 
Loss matrix, 310 
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Lossless compression, 515 
Lossy compression, 515 
LVQ, see Learning Vector Quan¬ 
tization 

Mahalanobis distance, 441 
Majority vote, 337 
Majorization, 294, 553 
Majorize-Minimize algorithm, 294, 
584 

MAP (maximum aposteriori) es¬ 
timate, 270 
Margin, 134, 418 
Market basket analysis, 488, 499 
Markov chain Monte Carlo (MCMC) 
methods, 279 
Markov graph, 627 
Markov networks, 638-648 

MARS, see Multivariate adaptive 

regression splines 

MART, see Multiple additive re¬ 
gression trees 

Maximum likelihood estimation, 

31, 261, 265 

MCMC, see Markov Chain Monte 
Carlo Methods 

MDL, see Minimum description 
length 

Mean field approximation, 641 
Mean squared error, 24, 285 
Memory-based method, 463 
Metropolis-Hastings algorithm, 282 
Minimum description length (MDL), 
235 

Minorization, 294, 553 
Minorize-Maximize algorithm, 294, 
584 

Misclassification error, 17, 309 
Missing data, 276, 332-333 
Missing predictor values, 332-333 
Mixing proportions, 214 
Mixture discriminant analysis, 449- 
455 

Mixture modeling, 214-215, 272- 
275, 449-455, 692 


Mixture of experts, 329-332 
Mixtures and the EM algorithm, 
272-275 

MM algorithm, 294, 584 
Mode seekers, 507 
Model averaging and stacking, 288 
Model combination, 289 
Model complexity, 221-222 
Model selection, 57, 222-223, 230- 
231 

Modified regression, 634 
Monte Carlo method, 250, 495 
Mother wavelet, 178 
Multidimensional scaling, 570-572 
Multidimensional splines, 162 
Multiedit algorithm, 480 
Multilayer perceptron, 400, 401 
Multinomial distribution, 120 
Multiple additive regression trees 
(MART), 361 

Multiple hypothesis testing, 683- 
693 

Multiple minima, 291, 400 
Multiple outcome shrinkage and 
selection, 84 

Multiple outputs, 56, 84, 103-106 
Multiple regression from simple uni¬ 
variate regression, 52 
Multiresolution analysis, 178 
Multivariate adaptive regression 
splines (MARS), 321-327 
Multivariate nonparametric regres¬ 
sion, 445 

Nadaraya-Watson estimate, 193 
Naive Bayes classifier, 108, 210- 
211, 694 

Natural cubic splines, 144-146 
Nearest centroids, 670 
Nearest neighbor methods, 463- 
483 

Nearest shrunken centroids, 651— 
654, 694 

Network diagram, 392 
Neural networks, 389-416 
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Newton’s method (Newton-Raphson 
procedure), 120-122 
Non-negative matrix factorization, 
553-554 

Nonparametric logistic regression, 
299-304 

Normal (Gaussian) distribution, 

16, 31 

Normal equations, 12 
Numerical optimization, 395-396 

Object dissimilarity, 505-507 
Online algorithm, 397 
Optimal scoring, 445, 450-451 
Optimal separating hyperplane, 132- 
135 

Optimism of the training error rate, 
228-230 

Ordered categorical (ordinal) pre¬ 
dictor, 10, 504 
Ordered features, 666 
Orthogonal predictors, 53 
Overfitting, 220, 228-230, 364 

PageRank, 576 
Pairwise distance, 668 
Pairwise Markov property, 628 
Parametric bootstrap, 264 
Partial dependence plots, 369-370 
Partial least squares, 80-82, 680 
Partition function, 638 
Parzen window, 208 
Pasting, 318 

Path algorithm, 73-79, 86-89, 432 
Patient rule induction method(PRIM), 
317-321, 499-501 
Peeling, 318 

Penalization, 607, see regulariza¬ 
tion 

Penalized discriminant analysis, 446- 
449 

Penalized polynomial regression, 

171 

Penalized regression, 34, 61-69, 171 
Penalty matrix, 152, 189 


Perceptron, 392-416 
Piecewise polynomials and splines, 
36, 143 

Posterior 

distribution, 268 
probability, 233-235, 268 
Power method, 577 
Pre-conditioning, 681-683 
Prediction accuracy, 329 
Prediction error, 18 
Predictive distribution, 268 
PRIM, see Patient rule induction 
method 

Principal components, 66-67, 79- 
80, 534-539, 547 
regression, 79-80 
sparse, 550 
supervised, 674 

Principal curves and surfaces, 541- 
544 

Principal points, 541 
Prior distribution, 268-272 
Procrustes 

average, 540 
distance, 539 

Projection pursuit, 389-392, 565 
regression, 389-392 
Prototype classifier, 459-463 
Prototype methods, 459-463 
Proximity matrices, 503 
Pruning, 308 

QR decomposition, 55 
Quadratic approximations and in¬ 
ference, 124 

Quadratic discriminant function, 
108, 110 

Radial basis function (RBF) net¬ 
work, 392 

Radial basis functions, 212-214, 
275, 393 
Radial kernel, 548 
Random forest, 409, 587-604 
algorithm, 588 
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bias, 596-601 

comparison to boosting, 589 
example, 589 
out-of-bag (oob), 592 
overfit, 596 
proximity plot, 595 
variable importance, 593 
variance, 597-601 
Rao score test, 125 
Rayleigh quotient, 116 
Receiver operating characteristic 
(ROC) curve, 317 
Reduced-rank linear discriminant 
analysis, 113 

Regression, 11-14, 43-99, 200-204 
Regression spline, 144 
Regularization, 34, 167-176 
Regularized discriminant analysis, 
112-113, 654 
Relevance network, 631 
Representer of evaluation, 169 
Reproducing kernel Hilbert space, 
167-176, 428-429 
Reproducing property, 169 
Responsibilities, 274-275 
Ridge regression, 61-68, 650, 659 
Risk factor, 122 
Robust fitting, 346-350 
Rosenblatt’s perceptron learning 
algorithm, 130 
Rug plot, 303 
Rulefit, 623 

SAM, 690-693, see Significance Anal¬ 
ysis of Microarrays 
Sammon mapping, 571 
SCAD, 92 

Scaling of the inputs, 398 
Schwarz’s criterion, 230-235 
Score equations, 120, 265 
Self-consistency property, 541 -543 
Self-organizing map (SOM), 528- 
534 

Sensitivity of a test, 314-317 
Separating hyperplane, 132-135 


Separating hyperplanes, 136, 417 
419 

Separator, 628 
Shape average, 482, 540 
Shrinkage methods, 61-69, 652 
Sigmoid, 393 

Significance Analysis of Microar¬ 
rays, 690-693 

Similarity measure, see Dissimi¬ 
larity measure 
Single index model, 390 
Singular value decomposition, 64, 
535-536, 659 
singular values, 535 
singular vectors, 535 
Sliced inverse regression, 480 
Smoother, 139-156, 192-199 
matrix, 153 

Smoothing parameter, 37, 156-161, 
198-199 

Smoothing spline, 151-156 
Soft clustering, 512 
Soft-thresholding, 653 
Softmax function, 393 
SOM, see Self-organizing map 
Sparse, 175, 304, 610-613, 636 
additive model, 91 
graph, 625, 635 
Specificity of a test, 314-317 
Spectral clustering, 544-547 
Spline, 186 

additive, 297-299 
cubic, 151-153 
cubic smoothing, 151-153 
interaction, 428 
regression, 144 
smoothing, 151 156 
thin plate, 165 

Squared error loss, 18, 24, 37, 219 
SRM, see Structural risk minimiza¬ 
tion 

Stacking (stacked generalization), 
290 

Starting values, 397 
Statistical decision theory, 18-22 
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Statistical model, 28-29 
Steepest descent, 358, 395-397 
Stepwise selection, 60 
Stochastic approximation, 397 
Stochastic search (bumping), 290- 
292 

Stress function, 570-572 
Structural risk minimization (SRM), 
239-241 

Subset selection, 57-60 
Supervised learning, 2 
Supervised principal components, 
674-681 

Support vector classifier, 417-421, 
654 

multiclass, 657 

Support vector machine, 423-437 
SURE shrinkage method, 179 
Survival analysis, 674 
Survival curve, 674 
SVD, see Singular value decom¬ 
position 

Symmlet basis, 176 

Tangent distance, 471-475 
Tanh activation function, 424 
Target variables, 10 
Tensor product basis, 162 
Test error, 220-223 
Test set, 220 
Thin plate spline, 165 
Thinning strategy, 189 
Trace of a matrix, 153 
Training epoch, 397 
Training error, 220-223 
Training set, 219-223 
Tree for regression, 307 308 
Tree-based methods, 305-317 
Trees for classification, 308-310 
Trellis display, 202 


Undirected graph, 625-648 
Universal approximator, 390 
Unsupervised learning, 2, 485-585 
Unsupervised learning as super¬ 
vised learning, 495-497 

Validation set, 222 
Vapnik-Chervonenkis (VC) dimen¬ 
sion, 237-239 

Variable importance plot, 594 
Variable types and terminology, 9 
Variance, 16, 25, 37, 158-161, 219 
between, 114 
within, 114, 446 
Variance reduction, 588 
Varying coefficient models, 203- 
204 

VC dimension, see Vapnik-Chervon¬ 
enkis dimension 
Vector quantization, 514-515 
Voronoi regions, 510 

Wald test, 125 
Wavelet 

basis functions, 176-179 
smoothing, 174 
transform, 176-179 
Weak learner, 383, 605 
Weakest link pruning, 308 
Webpages, 576 
Website for book, 8 
Weight decay, 398 
Weight elimination, 398 
Weights in a neural network, 395 
Within class covariance matrix, 114, 
446 


